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ABSTRACT 


Tfcii  report  sammarises  tab  research  activities  it  the  Information  Systems  Laboratory,  Staa* 
ford  University,  for  tbe  "Fast  Algorithm  for  improved  Speech  Coding  and  Recognition"  project 
daring  the  past  sixteen  months.  This  research  eSort  km  stadied  estimation  teckniqnes  for 
processes  that  contain  Gaossiaa  noise  and  jimp  components,  and  claasileatioa  methods  for  transi¬ 
tional  signals  by  ssiag  recnrsive  estimation  with  vector  qaaatisation.  The  major  accomplishments 
presented  are  an  algorithm  for  joint  estimation  of  excitation  and  vocal  tract  response,  a  pitch 
poke  location  method  nsiag  recnrsive  least  sqnares  estimation,  and  a  stop  consonant  recognition 
method  nsing  recnrsive  estimation  and  vector  qaantisation. 


M.  Morf  •  Principal  Investigator  and  J.  Tamer  •  Research  Associate 
W.  Stirling,  J.  Shyak,  and  S-S.  Hnaag 
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1.  INTRODUCTION 


During  the  eoane  of  this  research  contract,  estimation  techniques  for  processes  that  contain 
Gsnssisn  noise  and  jump  components,  and  classification  methods  for  transitional  signals  by  using 
recursive  estimation  with  vector  quantisation  were  studied.  These  signal  processing  tools  have 
possible  application  to  a  wide  mage  of  physical  signals,  although  this  research  studied  their  use 
for  speech  processing.  The  major  accomplishments  presented  are  an  algorithm  for  joint  estima¬ 
tion  of  excitation  and  vocal  tract  response,  a  pitch  pulse  location  method  using  recursive  least 
squares  estimation,  and  a  stop  consonant  recognition  method  using  recursive  estimation  and  vec¬ 
tor  quantisation. 

JOINT  ESTIMATION  OF  EXCITATION  AND  VOCAL  TRACT  RESPONSE 

Historically,  the  development  of  estimation  theory  and  signal  modeling  techniques  have  usu¬ 
ally  presumed  that  the  processes  involved  had  Gaussian  statistics.  Moot  naturally  occurring 
processes  tend  to  be  Gaussian.  However,  many  man-made  signals  have  additional  components 
that  can  be  characterised  as  harmonic  structures  or  jump  processes  or  impulsive  noise.  For  exam¬ 
ple,  the  rotating  blade  in  an  aircraft  generates  an  artifact  when  the  blade  crosses  the  wing,  like¬ 
wise  the  main  rotor  and  tail  rotor  of  helicopters  produce  signals  depending  on  the  orientation  of 
the  fuselage,  underwater  acoustical  signals  from  min  made  sources,  radar/sonar  returns  generated 
by  pulsed  sources,  and  in  general  any  signal  that  has  been  processed  in  a  nonlinear  fashion  are 
within  the  clam  of  aon-fssssiaa  signals. 

Estimation  techniques  were  developed  for  signals  composed  of  a  Gaussian  noise  component 
and  a  jump  process  component  driving  a  linear  system.  In  particular,  simultaneous  estimation  of 
the  system  parameters  (ARMA)  and  the  jump  excitation  were  introduced.  The  technique  evolved 
from  simple  pulse  in  noise  detection  to  composite  pulses  and  noise  . from  an  ARMA  structured  sys¬ 
tem.  A  decision-directed  approach  was  used  to  estimate  the  unknown  prior  statistics  of  the  pulse 
process.  A  frill  description  of  them  techniques  was  premated  in  the  list  ONR  technical  report, 
M736-I,  Feb.  1983. 


la  this  study,  the  estimation  technique  was  applied  to  speech  signals  attempting  to  improve 
the  estimate  of  pitch  aad  vocal  tract  response.  Most  speech  modeling  techniques  handle  the 
response  aad  excitation  separately.  The  semiperiod ic  opening  of  the  vocal  chords  emits  a  pulse  of 
air  to  excite  the  vocal  tenet  (throat,  tongue,  aad  month)  provides  an  example  of  jump  and  noise 
excitation  that  has  been  mnch  studied.  The  complex  interaction  of  the  vocal  chords,  vocal  tract 
and  none,  warrant  simultaneous  estimation  of  the  response  function  and  the  excitation. 

PITCH  DETECTION  BY  LEAST  SQUARES  LATTICE  ALGORITHM 

There  are  many  advantages  of  recursive  estimation  techniques  aad  particularly  when  imple¬ 
mented  in  the  form  of  a  lattice  liter.  An  overview  of  recursive  least  squares  estimation  aad  lat¬ 
tice  liters  was  presented  in  the  second  ONR  technical  report,  M736-2,  Jan.  1084.  Within  the 
Least  Squares  Lattice  algorithm,  a  "likelihood”  variable  is  calculated  which  indicated  the 
occurrence  of  unexpected  or  non-ganasiaa  components  in  the  agnal.  The  derivative  of  this  vari¬ 
able  multiplied  by  other  signal  parameters  appears  to  be  a  good  detector  of  pitch  pulses  in  speech. 
The  development  aad  experimental  results  of  this  pitch  detection  method  are  presented. 

RESEARCH  ON  RECOGNITION  OF  STOP  CONSONANT 

Recursive  Estimation  and  Vector  Quantisation  have  been  two  very  active  areas  of  research 
in  the  last  few  years.  Each  area  has  developed  new  mathematical  tools  for  analysing  aad  charac¬ 
terising  signals.  These  techniques  are  trying  to  satisfy  different  objectives;  adaptive  signal  model¬ 
ing  or  efficient  signal  quantisation,  respectively.  However  there  is  a  natural  marriage  of  these  two 
powerful  mathematical  tools  that  often  provides  a  more  appropriate  sedation  to  problems  in  signal 
modeling,  coding,  and  elassilcation. 

For  adaptive  speech  modeling,  the  time  varying  nature  of  speech  requires  that  quickly 
changing  burst  sounds  as  well  as  fairly  steady  vowels  sounds  be  efficiently  approximated.  The 
recursive  orthogonalisiag  properties  of  the  ladder  structure  allow  speech  transitions  to  be  tracked 
precisely  while  still  yielding  consistent  parameters  for  steady  sounds.  Recursive  estimation  gen¬ 
erates  a  Ml  signal  model  for  each  data  sample  causing  a  considerable  increase  in  the  number  of 
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parameters  handled.  For  coding  and  transmission  of  signals,  the  recarsive  estimation  generates  a 
good  signal  model  bat  the  problem  of  efficient  parameter  encoding  remains.  In  coding  or 
classilcatkn  applications,  only  a  small  aamber  of  'states  of  the  world'  are  of  interest  rather  than 
the  continaam  of  parameter  valaes  generated  by  RLS. 

Vector  Qaaatisation  (VQ)  design  algorithms  hare  been  ased  to  design  low  bit  rate  data 
compression  and  data  classification  systems.  For  speech  recognition,  vector  qaaatisation  tecb- 
niqaes  have  been  developed  for  speaker  dependent  and  independent  word  recognition.  VQ  is  well 
saited  for  data  compression  or  data  classification  once  the  codewords  have  been  determined  from 
a  representative  training  data  set. 

Experiments  on  combining  reearshre  estimation  and  vector  quantisation  were  began  in  this 
ONR  contract.  Using  recarsive  estimation  to  track  changing  signal  characteristics  and  vector 
qaaatixations  to  systematically  classify  the  resetting  parameter,  brings  together  adaptive  process¬ 
ing  with  limited  state  oatpat.  This  idea  eras  first  applied  to  speech  for  recognition  of  transitional 
soands,  which  are  carreatty  very  difkalt  to  distiagaish.  This  approach  acknowledges  that  speech 
contains  only  a  finite  aamber  of  identifiable  soaad  Baits  (in  each  laagaage),  bat  that  some  soands 
happen  qaite  quickly.  This  type  of  classification  techaiqae  distingaishes  transitional  states  in  the 
signal  that  are  themselves  of  interest. 

A  classification  scheme  asiag  parameter  trajectory  information  eras  developed  that  allows 
transitional  signal  characteristics  to  be  identified.  The  transitions  in  the  data  can  be  tracked 
asiag  recarsive  estimation  rather  than  being  coarsely  approximated  by  LPC  (or  equivalent) 
parameterixai  ions  from  fixed  speech  windows.  By  having  n  signal  model  at  every  data  sample, 
the  trajectory  of  the  parameters  can  be  readily  determined.  This  new  information  assisted  in 
determining  transitional  components  from  steady  state  components. 

A  classified  vector  qaaatisation  approach  was  also  developed  that  allows  quantisation  preci¬ 
sion  to  be  specified  for  varioas  signal  components.  No  laager  mast  the  steady  state  signal  com¬ 
ponents  dominate  the  vector  quantised  states.  The  resells  for  recognising  stop  consonants  within 
a  limited  test  are  very  encouraging. 
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2.  JOINT  ESTIMATION  OF  EXCITATION  AND 
VOCAL  TRACT  RESPONSE 

2.1  INTRODUCTION 


Many  speech  analysis  techniques  attempt  to  deconvolve  the  speech  waveform  into  an  excita¬ 
tion  component  and  a  response  function.  The  standard  approach  is  to  estimate  the  vocal  tract 
model  parameters  Irst  and  then  the  excitation  signal  from  the  residual  errors  (or  directly  from  a 
bandlimited  version  of  the  original  speech  signal).  A  new  approach  for  simultaneously  identifying 
the  system  model  parameters  and  detecting  the  unobserved  random  pulse-type  inputs  has  been 
developed.  A  key  component  of  this  procedure  is  the  application  of  a  new  decision-directed  algo¬ 
rithm  to  estimate  the  period  of  the  pitch  pulse  process.  This  decision-directed  algorithm  incor¬ 
porates  an  exact,  recursive  estimator  to  compute  the  rate  of  a  discrete-time  point  process  used  to 
characterise  the  arrival  times  of  the  pitch  pulse  process.  An  overview  of  this  approach  is 
presented  here.  The  complete  description  was  contained  in  the  first  ONR  technical  report, 


M738-1. 

A  common  assumption  of  speech  analysis  is  that  a  speech  waveform  can  be  modeled  as  the 
output  of  a  linear  system  driven  by  an  approximately  Gaussian  noise  part  (for  unvoiced  speech) 
plus  a  jump  component,  (periodic  pulses  for  voiced  speech).  Typically,  it  is  assumed  that  the 
linear  system  used  to  model  the  vocal  tract  consists  of  an  all-pole  filter  (an  autoregressive  or  AR 
representation)  whose  coefficients  are  slowly  time-varying.  The  all-pole  model  used  to  character¬ 
ise  the  vocal  tract  and  the  mixed  driving  process  (a  white  Gaussian  noise  plus  a  pulse  process) 
admits  the  representation 


»  * 

*r  +  S  ■  S  *<»»-<  +  M<  +  pi  • 

iml  i- 1 


(2.1) 


where  (yi)  is  the  observed  speech  waveform,  {n(}  is  a  binary  (0,1)  sequence  denoting  the  epochs 
of  the  pitch  pulses,  {vt}  denotes  a  WGN  process,  and  the  coefficients  s<  and  bt  denote  the  model 
coefficients.  The  estimation/detection  problem  is  to  simultaneously  estimate  these  coefficients 


and  to  detect  the  occurrence  of  the  pulses  (i.e.,  to  detect  the  events  tif  **  1).  Standard  least 
squares  techniques  are  used  to  estimate  the  a j  coefficients.  The  unique  aspects  of  this  analysis 
consist  of  the  approach  used  to  detect  the  events  n<  »■  1  and  the  joint  estimation  of  the  model 
parameters  and  detection  of  the  pulse  input  (excitation). 

The  estimation  of  the  coefficients  follows  in  a  straightforward  manner  once  the  pulses 
have  been  detected.  The  detection  problem  is  rendered  difficult  by  the  absence  of  reliable 
«  priori  information  about  the  probability  of  the  event  nt  ■  1.  The  problem  of  binary  detection 
with  unknown  priors  leads  to  the  application  of  so-called  decision-directed  (DD)  detectors.  DD 
detectors  (DS|,  (KD)  use  the  results  of  the  past  decisions  to  estimate  the  rate  (i.e.,  the  a  priori 
probability)  of  the  signal,  which  is  used  to  adjust  the  parameters  of  the  detector  (assuming  that 
the  previous  decisions  were  correct).  A  method  of  dealing  with  nonstationary  priors  in  a  DD  algo¬ 
rithm  was  developed  in  |SM].  Specifically,  the  speech  problem  results  in  a  pulse  process  that  is 
intermittent  (present  for  voiced  speech  only),  and,  when  present,  is  of  a  highly  structured  nature 
(the  pitch  process  exhibits  a  nearly  periodic  structure).  An  algorithm  to  simultaneously  estimate 
the  vocal  tract  parameters  and  to  detect  and  estimate  the  pitch  pulse  waveform  as  well  is 
presented  here. 
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1J  DECISION-DIRECTED  DETECTOR 
Consider  a  discrete-time  point  process  (DTPP)  {nf}  for  t  “  1,2,3, .  . . ,  such  that 

Pt^  -  1  |  B|_i)  -  1  -  Pr(nt  -  0  |  B,.,)  -  X,  (2.2) 

where  X|  is  the  random  rate  of  the  process  {n<}  and  BM  is  the  sigma  field  generated  by  all  the 

factors  that  affect  the  probability  of  a  poise  occurring  at  time  <-l.  To  simplify  the  development, 

assume  that  the  effect  of  the  jump  is  restricted  to  isolated  time  points,  «  0  for  i  >  0.  The 

prediction  error  process  is  et. 

—  y,  -  A,r  Y«  where  Y,  «*  [-ft-i,  -  •  •  A  —  ,  s*]f2.3) 

The  symbol  *  denotes  the  least  squares  estimate  of  the  vector  A.  The  detection  problem  is  to 

decide  between  the  two  hypotheses  H0,  noise  only  *nd  Hu  pulse  plus  noise. 

£|  *  n 

Ht:  e,  -  *  +  ft  (2  4) 

The  Bayes  decision  rale  with  respect  to  X|  is  Nt. 

f  1.  if  (l-X|)/(£t|n,-0)<  X(/(£||nf-l) 

N‘  "  lo,  otherwise  <2  5) 

where  /(■|«(«0)  and  /('|d|«1)  are  the  density  functions  of  tt  under  hypotheses  Ht  and  , 
respectively.  The  output  of  the  detector  is  the  sequence  {N,}.  Let  X(  denote  the  rate  of  Nt. 
The  philosophy  of  the  DD  approach  is  to  estimate  Xj,  and  to  use  this  estimate  for  subsequent 
operation.  For  the  case  where  ft  ~  JV(0,1),  the  likelihood  ratio  test  (LRT)  of  (2.5)  assumes  the 
form 


M  - 


i.  «i  >  r(\') 
o,  «,  <  r(x,') 


(2.6) 


where  T(X)  ■»  1/2- (logX  -  log (1— X)] /  J  and  is  an  estimate  of  Xj. 

Suppose  that  the  rate  of  Nt  can  be  modeled  as  a  finite-state  Markov  chain  with  state  vector 
p  mm  \pv  .  . .  ,pm\T,  where  px<,  . . .  ,<pm,  with  transition  probabilities  given  by 

(2.7) 


■  Pj  I  *i-i  “  Pi)  ■  f<>(0 

with  initial  distributior  •  —  [*.  . ,  r  where  *,  «■  Pr(X0  *■  pt). 


>:v  •: 


-v-v-v:  c-j 


X 


Define  X|  —  |*i(0.  -.*»(<)|T  *>T 


f  i,  if  x;  — 

*  to,  otherwise  '  1  —  1,2, . . .  ,i 


Thns,  X|  “  pTx t.  This  formulation  was  first  introduced  by  Segall  (Se).  The  vector  xt  can  be 
viewed  as  the  state  vector  of  a  system  obeying  dynamics  and  observation  equations  of  the  form 


*1+1  —  Q<T*i  +  (2.0) 

N,  —  pTx,  +  C| 

where  Qf  —  (?y(<)}.  The  processes  {u,}  and  {e(}  are  Martingale  Difference  sequences  with 
respect  to  the  family  of  sigma  fields  {B(}  with  B(  —  <r{Nv  .  .  .  ,N,p iv  .  .  .  ,xl+ ,}. 

A  general  estimator  for  this  problem  was  developed  in  (St)  for  the  case  where  the  transition 
matrix,  Qt,  is  not  only  time  dependent,  but  is  realization-dependent  as  well.  Suppose  the  transi¬ 
tion  matrix  is  conditioned  on  F(l  and  admits  the  structure  in  (2.10). 


*«->)■ 

,y(0  *  lro(<,PM),  ifiv.-o 


The  matrices  S,  ■>  {*,>(< .F" *_!)}  and  Rt  “  (ry(i,  F|_j)}  thus  define  the  dynamics  of  the  Mar¬ 
kov  chain.  Note  that  these  matrices  are  conditioned  on  the  past  data,  represented  by  F,_,.  The 
estimator  takes  the  form  in  (2.11). 


M 


*i+iii  ™  Ai*»|i-i  + 


The  estimated  rate  is  given  by 


V+ii<  ™  ^r*»+n»  • 


S|rdi»y(p)x||i_t  - 
Pr*»|i-i  ~  (^r*»|i-i)a 
A|  «  Rjr  -  (Rj  -  S t)Tdiag(p) 

£|  —  *i+i|i  *j+i|» 


\>>.v  ,\-.y  .;.*s 

^  ■ *  -  -  - 
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2.3  APPLICATION  TO  PITCH  DETECTION 

Consider  the  waveform  (et)  defined  by  (2.3),  and  suppose  that  vt  ~  N(0,1),  b  is  a  constant, 
and  {n(}  is  a  DTPP  that  is  pseudo-periodic,  in  the  sense  that,  once  a  pulse  occurs,  the  probability 
of  another  pulse  occurring  soon  is  small,  but  increases  as  time  progresses  (an  example  of  such  a 
process  is  the  sequence  of  glottal  pulses  of  voiced  speech).  Also,  suppose  that  the  repetition  inter¬ 
val  (or  pseudo-period)  of  this  process  may  also  be  changing  (e.g.,  as  the  pitch  period  is  modulated, 
as  with  a  singing  voice).  The  near  periodicity  of  the  signal  may,  however  be  directly  incorporated 
into  the  structure  of  the  Q(  matrix  as  introduced  above.  Define  the  elements  of  Q(  as  follows: 


11  -  i(A+ 1 1  A).  i  —  l. —  l 
ht,.»  «— i,j>i 

.,))*,(A+l|A), 

|i-/(/,..,)i(l-V’),  »w>l 

f*3, y»2  ant 


U-A/,.*,)!  v 


*>i.  y— i 
»*i>i 
t*3J*2  and 

»'»;'+ 1,;>2  and 
i*;-l,2<;  <m 


(2.13) 


where  6  (0, 1)  is  a  constant; 


(1,  if  /,  <  t  <  a, 
0,  otherwise 


is  the  indicator  function; 


J,  *  max 


(; :  +  1.  i  <  *  ] 

(  »«i  «— i  ) 


is  the  last  time  (up  to  and  including  t)  that  a  pulse  was  detected;  and 


—  A  +  T>  7 

i|  /,  +  oi, 

where  the  conditional  expectation  of  the  estimation  error  is 
fff  “  ^'(Vt-ij*)  “  prdi«ff(p)  -  (\+i|i)* 


In  addition  to  the  estimation  of  the  model  parameters  and  the  detection  of  the  pitch  epochs, 
the  speech  analysis  problem  requires  the  estimation  of  the  variance  of  the  input  noise  process  vt, 
and  the  tracking  of  slowly-varying  model  parameters.  The  pitch  pulse  is  usually  not  of  a  single 
sample  time  duration,  but  may  persist  for  several  sample  times.  Thus,  the  combined 
estimation/detection  estimator  must  be  generalized  to  allow  input  noise  variance  estimation  and 
the  estimation  of  composite  pulses  (i.e.,  pulses  that  persist  for  several  time  samples).  The  general¬ 
ized  algorithm  is  (2.18)  where  Y  ,  and  A  are  defined  as  in  (2.3). 

A,  =  A,_,  +  P,Y,  [v,  -  AmY ,  -  £*;(f-l)Ali_,|  (2.18) 


The  matrix  Pt  is  given  by 

'  ai  [  '  1  «i  +  Y(rP(-iYi  | 


The  unnormalized  intensity  estimate  of  the  composite  pulse  profile  is 


*;«)  =  (*i)  6.(o 


S,  «■  5,.,  + 


y. a'-  l  «» 


6.(0  «  6  j(f-l)  +  - It,  -  6  ,(<-l)J 

XX-w. 


The  parameters  a,,  a2i  and  a3  are  the  weighting  factors  for  the  model  coefficients,  the  energy  in 
the  deconvolved  waveform,  and  the  pulse  intensity,  respectively.  The  process  JV,  is  given  by 
(2.5).  Fig.  2.1  illustrates  the  block  diagram  of  the  joint  estimation  and  excitation  detection  sys- 


The  operation  of  the  system  with  these  transition  dynamics  is  essentially  as  follows.  Once  a 
pulse  is  detected,  the  Markov  chain  is  forced  into  its  lowest  state,  p,;  thus  raising  the  threshold 
and  reducing  the  probability  of  false  alarms  over  the  interval  immediately  following  the  detection 
of  a  pulse.  Once  the  time  interval  a,  -  7,  has  elapsed,  the  Markov  chain  is  restored  to  its  state 
at  the  time  that  the  last  pulse  was  detected.  Fig.  2.2a  illustrates  the  residual  from  the  AR 
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approximation  sad  the  adaptive  threshold  used  to  detect  the  pulses.  Note  that  the  residuals  are 
clearly  non-gaussiaa.  Fig.  2.2b  shows  the  residuals  after  the  detected  pulses  have  been  removed. 
Fig.  2.2c  shows  the  estimated  pulse  rate  as  defined  by  (2.12).  The  dash  line  indicated  the  pulse 
rate  uncertainty  as  given  by  (2.17).  The  advantages  of  this  procedure  are:  1)  the  near-periodic 
nature  of  the  process  may  be  explicitly  modeled;  and  2)  the  "period”  of  the  pulse  train  is  adjusted 
adaptively,  and  the  probability  of  false  alarms  decreases  as  <rt  decreases. 

The  ability  of  this  algorithm  to  track  pitch  period  variation  and  detect  unvoiced  speech  is 
illustrated  in  Fig.  2.3.  Fig.  2.3a  is  the  beginning  of  the  phrase  *  Thieves  who  rob  ..  "  .  The 
estimated  pitch  rate  is  shown  in  Fig.  2.3b;  note  the  transition  in  pitch  period  and  detection  of 
unvoiced  regions.  In  this  example  the  pitch  pulse  was  assumed  to  consist  of  three  successive  time 
samples.  The  estimated  weighting  coefficients  are  illustrated  in  Fig.  2.4. 


1.4  CONCLUSIONS 


A  new  approach  to  the  pitch  detection  problem  of  ipeech  analysis  has  been  presented.  This 
solution  provides  a  mechanism  to  account  for  the  structure  of  the  pitch  period,  and  thereby  allows 
a  reduction  in  pitch  detection  errors  (false  alarm  rate).  The  key  feature  of  this  procedure  a  a  new 
decision-directed  algorithm  that  incorporates  a  finite-state  Markov  chain  model  for  the  rate  of  the 
process,  and  provides  an  exact,  recursive  nonlinear  estimator  for  the  rate.  The  algorithm  allows 
the  estimation  of  time-varying  model  parameters  and  the  variance  of  the  input  WGN  process. 

The  algorithm  has  been  applied  to  samples  of  actual  speech,  and  promising  results  have 
been  obtained.  It  should  be  emphasised  that  much  more  work  must  be  performed  in  order  to 
validate  this  algorithm  in  actual  speech  analysis,  but  these  preliminary  results  appear  encourag- 
»*• 

A  further  description  of  this  decision-directed  method  of  estimating  a  joint  noise  and  jump 
process  was  presented  in  the  first  ONR  project  report,  M736-1,  Feb.  1083. 
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Figure  2.1  System  Model  Parameter  Estimation/Input  Pulse  Detection 
Algorithm  for  Simple  Pulses 


8.  PITCH  DETECTION  BY  LEAST  SQUARES  LATTICE 


8.1  INTRODUCTION 

A  new  method  of  pitch  detection  for  speech  has  been  developed  thnt  is  bnsed  a  poo  the 
nnnormslised  pre-windowed  least  squares  lattice  algorithm.  It  is  an  extension  of  a  previously  stu¬ 
died  method  [LM]  that  involved  the  forward  residuals  and  the  so-called  likelihood  variable.  By 
incorporating  information  from  the  forward  residual  covariance,  well  defined  pitch  pulse  locations 
are  produced  from  which  the  period  can  easily  be  determined. 

A  well  known  pitch  detection  method  (LPC-10)  (NSAj  using  the  average  magnitude 
difference  function  is  discussed  in  Section  3.2.  The  unnormalized  pre-windowed  least  squares  lat¬ 
tice  algorithm,  which  is  fundamental  to  our  approach,  is  summarized  in  Section  3.3.  The  new 
method  of  pitch  detection  and  the  pitch  variable  is  presented  in  Section  3.4.  Simulation  results 
using  sampled  speech  and  comparisons  are  made  with  LPC-10  are  in  Section  3.5. 

An  efficient  speech  representation  that  captures  the  bask  patterns  in  speech  is  essential  for 
speech  transmission  at  low  bit  rates  or  for  speech  recognition.  The  most  popular  parametric 
speech  model  consists  of  a  linear  filter  with  time  varying  coefficients  driven  by  a  time  varying 
excitation  process.  The  Linear  Predictive  Coding  (LPC)  [MG],  [RS]  model  has  an  all  pole  filter 
with  regularly  updated  coefficients  excited  by  either  white  noise  or  a  periodk  pulse  sequence.  The 
filter  represents  the  time  varying  nature  of  the  vocal  tract.  The  filter  parameters  determine  the 
spectral  characteristics  of  the  resulting  sound  for  both  types  of  excitation.  The  periodic  pulses 
generate  voiced  sounds  such  ns  vowels  while  unvoiced  or  hiss  sounds  are  produced  by  the  white 
noise  process.  Thus,  the  important  parameters  of  such  a  speech  model  are:  (1)  filter  coefficients, 
(2)  voked  or  unvoked  decision,  (3)  period  of  the  pitch  pulses  (if  voked),  and  (4)  signal  energy. 
Based  on  the  above  parametric  speech  model,  Fig.  3.1  displays  the  corresponding  speech  transmis¬ 
sion  system.  The  analysis  component  of  the  system  determines  the  speech  parameters  which  are 
then  encoded  for  transmission  across  the  channel.  At  the  receiver,  a  synthesis  filter  characterised 
by  the  received  coeffkients  is  driven  by  the  appropriate  excitation  process  to  generate  a  waveform 
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whkh  hopefully  sounds  like  the  original  speech. 

The  temporal  information  carried  by  the  periodic  pulses  or  the  change  from  noise  to  pubes 
is  perceptually  very  important.  The  effect  of  erron  in  the  excitation  cause  severe  distortion  in  the 
synthesized  speech.  Errors  in  estimating  the  filter  coefficients  cause  changes  in  the  spectrum  of 
the  sound  which  tends  to  muffle  the  speech  sound.  Several  techniques  have  been  developed  to 
estimate  the  filter  coefficients.  Unfortunately,  the  periodic  excitation  component  is  the  most 
difficult  to  estimate.  Our  research  activities  in  this  area  have  been  directed  at  better  determina¬ 
tion  of  the  occurrence  of  pitch  pulses. 

3.2  STANDARD  PITCH  ESTIMATION  TECHNIQUE 

The  pitch  detection  procedure  used  in  LPC-10,  the  National  Security  Agency  standard  for 
2400  bit  per  second  speech  transmission,  was  used  as  a  benchmark  for  pitch  period  estimates,  see 
Fig.  3.2.  The  transmitter  is  comprised  of  the  necessary  components  required  to  determine  the 
parameters  of  the  above  speech  model.  Note  that  the  reflection  coefficients  (RC),  energy  (RMS), 
voiced/unvoiced  (VUV)  decision,  and  pitch  for  a  segment  of  speech  are  encoded  for  transmission. 
A  speech  segment  is  typically  180  samples  (8000  Hz  sampling  rate). 

The  pitch  information  is  obtained  by  a  series  of  operations  on  the  speech  waveform  as  indi¬ 
cated  by  Fig.  3.3.  First,  the  speech  is  filtered  by  a  low  pass  Butterworth  filter  (800  Hz 
bandwidth).  This  output  is  then  whitened  by  a  low  order  adaptive  inverse  filter  to  remove  the 
speech  formants.  The  average  magnitude  difference  function  (AMDF)  of  the  resulting  waveform 
is  then  computed  as  in  (3.1)  where  it  is  the  low  pass  and  inverse  filtered  speech  and  L  is  the 
length  of  the  speech  segment  [RSCFM]. 

F.  -  7-  £  I*  -  I  .  »  -  -{L- 1) ,  •  ■  0  . .  ,  (£-1)  (3.1) 

Deep  nulb  occur  in  F,  at  delays  corresponding  to  the  pitch  period  of  a  voiced  sound  having  a 
quasi-periodic  structure.  From  this  information,  a  pitch  decision  algorithm  involving  dynamic 
programming  determines  the  pitch  period  for  the  speech  segment.  The  voiced/unvoiced  decision 
b  made  from  a  zero  crossing  analysis  of  the  speech  and  the  energy  of  the  low  pass  filtered  speech. 
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3.3  PRE- WINDOWED  LEAST  SQUARES  LATTICE 

Since  the  pitch  detection  scheme  utilizes  parameters  from  the  Least  Squares  Lattice  estima- 
tioo  algorithm,  this  algorithm  will  be  briefly  introduced  here.  The  ”unnormalized”  pre-windowed 
least  squares  lattice  algorithm  [Lee]  was  first  derived  from  the  well  known  multi-channel  Levinson 
(LWR)  algorithm  for  stationary  processes.  A  more  complete  description  of  recursive  least  squares 
estimation  is  presented  in  |T|.  The  LWR  solution  involves  solving  the  so-called  normal  equations, 
(3.2)  recursively  for  the  forward  and  backward  predictor  coefficients  it  and  A,. 


The  (ensemble)  covariance  matrix  of  the  process  is  RF  and  R /  and  R'  are  the  forward  and  back¬ 
ward  prediction  error  (i.e.  residual)  covariances.  The  forward  and  backward  residuals  tf  T  and 
Tt'f  are  obtained  from  the  predictor  coefficients  and  the  process  yT. 


*,.t  —  Vt  + 


rf,T  “  »r-t  +  £  *<  VT-,+  i 


In  the  derivation  of  the  LWR  algorithm,  the  mean  square  prediction  error  is  minimized  or 
equivalently  the  following  orthogonality  property  is  satisfied  at  each  order-update  recursion  (E 
denotes  expectation). 

E(e,,r  yk)  -  0  ,  T-p  <  k  £  T- 1  (3.4) 

When  the  desired  filter  order  N  is  obtained,  the  recursions  terminate  resulting  in  only  0(JV*)  com¬ 
putations  compared  to  0(Af®)  required  to  simply  invert  R, . 

It  can  be  shown  that  the  LWR  algorithm  leads  naturally  to  a  lattice  filter  structure  that 
computes  the  forward  and  backward  residuals.  However,  the  reflection  coefficients  are  fixed 
(time-independent)  since  the  recursions  are  strictly  an  order-update  solution  for  a  stationary  pro 
cess  with  known  second  order  statistics  R, .  As  a  consequence,  the  LWR  lattice  solution  b  inca¬ 
pable  of  tracking  statistical  variations. 


v.  V. 


■*.  • -a \  • 


.  -  -  v.-  -  >  - 

V  .  -  V 


-20- 


Consequently,  the  pre-windowed  lattice  algorithm  was  developed  to  track  nonotationary 
processes  without  any  knowledge  of  the  underlying  statistics.  Because  the  statistics  are  assumed 
unknown,  the  sum  of  squared  prediction  errors  weighted  by  X  is  minimized  instead. 

£  Xr-‘  e*,  (3.5) 

tst 

The  exponential  forgetting  factor,  X  (0<X<1),  permits  more  rapid  tracking  of  statistical  varia¬ 
tions  in  the  process.  The  resulting  solution  extends  the  LWR  solution  to  include  time-update 
recursions  so  that  the  reflection  coefficients  become  time  varying  in  general. 

In  order  to  introduce  the  time-update  expressions,  subscript  T  has  to  be  appended  to  the 
coefficients  to  indicate  that  they  are  time-dependent.  The  forward  and  backward  predictor 
coefficients  become  aitr  and  biT.  The  sample  covariance  of  the  process,  RpT  is  defined  as  in 
(3.6). 

Vo  •  •  V,  ■  •  Vr 
0 

Rp  T  -  Yptt  YpT  where  YpT  -  .  (3.6) 

o  •  0  y0  .  .  yT_r 

An  'auxiliary  set  of  coefficients  is  necessary  to  facilitate  the  time-update  expressions.  The 
particular  quantity  of  interest  is  known  as  the  likelihood  variable,  'jfi  ]•,  and  acts  like  an  adaptive 
weighting  factor  involving  previous  data. 

1,.r  —  |  Vr  .  •  •  • ,  Vr- ,  i  I  Vr . Vr-,  |r  (3.7) 

The  resulting  algorithm  is  denoted  p  re- wind  owed  since  the  data  matrix  (3.6)  assumes  that  data 

prior  to  y0  is  exactly  zero.  Without  going  into  the  details  of  the  derivation,  we  now  discuss  the 

algorithm  and  the  corresponding  lattice  structure  of  Fig.  3.4. 

The  input/output  expressions  for  the  forward  and  backward  residuals  of  each  lattice  section 
use  Kf+ ,  T  and  K'+  i  r,  the  forward  and  backward  reflection  coefficients. 

es+i.r  ™  *,.T  “  Kf+i ,r  ro,T- i  (3-8) 

fr+i.r  “  to,t- i -  *»+i,rVr 

The  lattice  structure  and  (3.8)  follow  directly  from  the  LWR  solution  except  that  in  this  case,  the 
coefficients  are  time  dependent  (denoted  by  the  subscript  T).  The  order-update  expressions  for 
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the  forward  and  backward  residual  covariances  also  follow  from  the  Levinson  solution. 

R}+  t,r  =“  R}.t  -  Kp+ 1 ,t  &p+  i,r  (3.9) 

R}+i.t  “  r1.t  -  Kp+  l.r  &,+  ir 

The  sample  partial  correlation  coefficient  (PARCOR)  is  the  A ,+  i.t-  When  appropriately  normal- 
ixed,  the  partial  correlation  coefficient  becomes  the  reflection  coefficients  which  have  the  desirable 
numerical  feature  of  being  bounded  by  ±  1. 

tf/+  i.r  “  &p+  i.t  Rp'r  K,+ 1 ,t  —  A,+  1>r  Rpj-i  (3.10) 

Here  Rp'r  and  1  are  the  matrix  inverse  of  /?/  j-  and  R},t- i.  respectively.  The  order-update 
expressions  for  the  covariances  (3.9)  are  employed  initially  when  the  time  index  is  not  greater 
than  the  desired  filter  length  N. 

The  remaining  recursions  in  the  algorithm  involve  the  likelihood  variable  and  represent  the 
major  difference  between  the  LWR  solution  and  the  adaptive  lattice  solution.  The  PARCOR 
variable  can  also  be  time-updated. 

A»+i.r  “  *  A»+i.r-i  +  *,.t  ',.t- 1 1  (l~  7,-i,r-i)  (3.11) 

When  the  time  index  exceeds  the  filter  order,  the  covariances  are  time-updated  instead  as  in 
(3.12). 

R}+\ .T  “  ^  l.r-i  +  «*+i,r  /  (1  -  7*,r-i)  (312) 

Rp+i,T  “  X  i,r-i  +  rp+i.T  /  (1  -  lp,r) 

The  likelihood  variable  is  updated  as  in  (3.13). 

7,.r-7,-i.r  +  tfrVr  (3.13) 

It  can  be  shown  that  the  range  of  7»,r  '*  between  xero  and  one. 

The  complete  set  of  order  and  time-update  recursions  of  the  unnormalixed  pre-windowed 
adaptive  lattice  algorithm  with  exponential  weighting  are  given  by  (3.8)  to  (3.13). 
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3.4  PITCH  DETECTION  BASED  ON  LEAST  SQUARES  LATTICE 

The  method  of  pitch  prediction  is  an  extension  of  previous  results  [LM]  obtained  with  the 
unnormalized  pre-windowed  lattice  algorithm  of  Section  3.3.  This  previously  studied  scheme  util¬ 
ized  information  contained  in  the  forward  residuals  and  the  likelihood  variable  to  determine  pitch 
pulse  locations  in  the  speech  waveform.  The  results  were  promising  since  well  defined  pitch  pulses 
could  be  identified.  However,  in  addition  to  these  desired  pulses,  spurious  less  dominant  ones 
were  also  present.  Removing  these  from  the  waveform  required  a  high  degree  of  heuristic  factors 
that  resulted  in  limited  success. 

The  new  method  of  pitch  detection  enhances  the  previous  results  by  employing  the  forward 
residual  covariance.  Consequently,  more  clearly  defined  pitch  pulses  can  be  obtained  so  that  less 
heuristic  factors  are  required  to  identify  the  desired  pulse  locations.  The  significance  of  the  lattice 
variables  used  in  the  pitch  estimation  process;  forward  residuals,  likelihood  variable,  forward 
residual  covariance  is  discussed. 


Forward  Residuals 


Consider  a  data  sequence  yt  where  the  time  index  k  ranges  from  a  finite  time  in  the  past 
(denoted  zero)  to  the  present  time  T.  The  pu  order  forward  residual  tr  T  is  then  defined  as  the 
difference  between  the  actual  value  yT  and  a  linear  least  squares  estimate  y  r|  that 

involves  only  p  previous  data  samples  (yr_p  ,  ,  tlr-i)- 


er.T  “  Vt  ~  V  r|  r-i.r-r  (3.14) 

This  estimate  results  from  the  projection  of  yT  on  the  space  spanned  by  the  p  previous  measure¬ 
ments.  The  coefficients  for  a  linear  predictor  are  s,  r,t- 

V  r|r-i,r->  =  -  £  *p.T,k  ilr-i  (3.15) 

t=i 

Now,  e,  r  represents  the  new  information  in  yT  that  is  not  present  in  the  p  previous  meas¬ 
urements.  As  a  result,  it  can  provide  information  concerning  waveform  changes  that  may  not  be 


as  obvious  in  the  original  process.  It  is  precisely  this  feature  of  the  residuals  that  is  important  for 
pitch  detection.  If  one  observes  a  voiced  segment  of  speech,  the  quasi-periodic  structure  is  easily 


seen.  However,  it  is  difficult  to  consistently  identify  waveform  locations  from  which  to  reliably 
extract  the  pitch  period.  This  occurs  since  there  is  a  high  degree  of  correlation  between  speech 
samptes  jRSCFMj. 

Since  the  residuals  are  a  whitened  form  of  the  speech  process,  they  provide  more  clearly 
defined  events  from  which  to  identify  the  pitch  period.  There  is  much  other  information  con¬ 
tained  in  the  residuals,  extraneous  to  pitch  detection,  that  must  be  removed  or  masked.  This 
function  is  provided  by  the  likelihood  variable  and  the  forward  residual  covariance.  Since  tr  T  is 
not  truely  a  whitened  process,  ie.  innovations  and  since  as  much  uncorrelation  as  possible  is 
required,  only  the  highest  order  residual  eN  T  is  considered.  The  true  innovations  involve  all  past 
data,  y0,  ,  Vt- i- 

Likelihood  Variable 

The  definition  of  the  likelihood  variable  7,7  from  (3.7)  in  terms  of  the  sample  covariance 
Rp  X  is  (3.16). 

Tf,.r  —  y t  T-t  Rp't  yT:T-r  where  Yt.t-p  "  [  Vt . Vr-r  Ir  -&-16) 

For  a  (zero  mean)  Gaussian  process,  the  pu  order  likelihood  function  is  p(YT:T_p)  where  Rp  is 
the  ensemble  covariance  of  the  process. 

P[YT  t.,)  =  (2  *)-'/’  |  Rp  |  -»/*  exp(  -1/2  Yf  r.,  Rp'»  YT  T_,  )  (3.17) 

Thus  7p  T>,  called  the  likelihood  variable  is  an  estimate  of  the  exponent  of  the  likelihood  function. 
Although  not  obvious  from  thb  result,  it  has  been  shown  (by  simulation)  that  7,  r  is  a  good  indi¬ 
cator  of  deviations  from  a  Gaussian  distribution  (LM,  ML].  This  is  of  course  desirable  for  pitch 
detection  since  the  speech  model  consists  of  a  Gaussian  component  for  unvoiced  segments  and  a 
non-Gaussian  quasi-periodic  component  for  voiced  segments.  Thus,  sudden  changes  in  7 ,  t 
should  indicate  the  onset  of  voiced  segments  in  speech.  In  fact,  simulation  results  show  that  7,,? 
does  change  significantly  for  voiced  speech  segments. 

The  likelihood  variable  detects  general  statistical  deviations  (see  Section  3.5).  Consequently 
other  speech  characteristics  such  as  plosives,  which  are  not  quasi-periodic,  are  also  detected  by 
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Nevertheless,  promising  results  have  been  obtained  by  multiplying  together  the  forward 
residual  signal  and  the  derivative  of  7,  T.  Simulations  have  shown  that  much  of  the  extraneous 
information  contained  in  the  forward  residuals  is  removed  to  expose  well  defined  pulse  locations 
from  which  to  identify  the  pitch  period.  However,  as  mentioned  before,  spurious  pulses  generally 
remained  which  were  then  removed  by  a  combination  of  thresholding  and  an  exponentially  decay* 
ing  function  that  basically  extracts  the  largest  peaks  over  the  waveform.  These  heuristic  methods 
can  be  reduced  by  the  enhanced  pitch  detection  method.  The  new  method  which  utilizes  the  for¬ 
ward  residual  covariance  further  reduces  the  occurrence  of  these  spurious  pulses 
without  thresholding. 

The  role  of  the  forward  residuals  is  important  since  7>r  corresponds  precisely  to  their  nor* 


malized  sum  (squared). 


'tr.T  “*  £  ek, t* 


Thus  ~tp  T  contains  information  from  p  measurements  see  (3.16).  For  the  same  reason  that  eN  T 
is  used,  only  the  highest  order  quantity  7 N  T  is  used  for  pitch  detection;  simulations  have  shown 
that  7 N  T  produces  better  results  (than  lower  orders)  *  namely  well  defined  pitch  pulse  locations. 


Forward  Residual  Covariance 

Recall  the  time-update  expression  for  the  forward  residual  covariance  is  (3.19). 

RI+i,t  ™  +  e/+i,r  /  0  -  7,.r-i)  (3-19) 

It  is  essentially  the  sum  of  the  present  and  all  previous  (exponentially  weighted)  forward  residuals 

squared.  Since,  the  effect  of  the  initial  order-update  recursion  becomes  negligible,  especially  with 

the  exponential  "forgetting”  factor,  X,  the  effect  of  the  initial  order-update  (3.9)  can  be  ignored. 

Simulations  indicate  that  the  effect  of  7>r_,  on  A/.r-i  **  small  and  can  be  ignored. 

*;♦! ,r«  £  Xr‘*  ««?♦»  (3.20) 

As  a  consequence  of  this  lengthy  memory,  the  covariance  does  not  change  significantly  except  for 


large  (magnitude)  increase*  in  the  residents.  This  may  occur,  for  example,  when  the  variance  of 
the  under  lying  process  increases  as  in  the  case  of  voiced  segments  of  speech.  Furthermore,  the 
covariance  does  not  change  much  for  decreases  in  the  residuals.  The  degree  of  change  is  affected 
directly  by  the  value  of  X,  the  memory  factor  in  the  algorithm. 

This  is  a  desirable  feature,  not  shared  by  the  likelihood  variable,  since  the  covariance  can 
detect  a  specific  event  of  the  waveform.  Thus  it  beconu*s  possible  to  consistently  track  recurring 
large-magnitude  increases  in  the  speech  waveform.  By  further  masking  (multiplying  time  signals 
together)  the  forward  residuals  with  the  derivative  of  the  covariance,  a  single  event  in  each  period 
of  the  voiced  segment  can  then  be  emphasised  and  therefore  be  more  easily  detected.  In  fact, 
simulation  results  show  that  employing  the  covariance  does  enhance  significantly  the  pitch  pulse 
locations.  Consequently,  the  need  for  thresholding  is  reduced  and  windowing  can  be  used  instead 
of  an  exponentially  decaying  function. 

Simulation  results  also  show  that  the  highest  order  covariance  Rf,  T  provides  better  results 
than  lower  orders;  this  appears  to  be  related  to  the  reduced  correlation  of  the  forward  residuals 
eN.T- 

Method  of  Pitch  Detection 

The  fundamental  concepts  underlying  the  new  method  of  pitch  detection  have  now  been  dis¬ 
cussed.  Those  concepts  can  be  combined  into  a  single  pitch  detection  variable.  Recall  that  the 
likelihood  variable  detects  changes  in  the  process  statistics.  Consequently,  it’s  derivative  (i.e.  first 
order  time  difference)  indicates  the  intensity  of  those  changes. 

&1N.T  ™  1N.T  ~  lN.T-1  (3-21) 

If  the  forward  residuals  are  multiplied  by  (3.21),  then  statistical  changes  in  the  process  can  be 

emphasised.  However,  since  (3.21)  detects  more  events  than  that  required  for  pitch  detection,  the 

residuals  are  multiplied  by  the  derivative  of  the  forward  residual  covariance. 

GRn.t  “  Rn,t  -  Rn,t-i  (3.22) 

This  will  then  emphasise  only  those  statistical  changes  that  also  include  an  increase  in  variance. 

Thus  the  complete  pitch  detection  variable,  denoted  r>N  t,  is  (3.23). 
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Vn.t  “  POS  (  tN  T  b*in,T  SRn.t  )  (3.23) 

where  POS  simply  retains  positive  results  of  the  quantity  in  parentheses. 

Equation  (3.22)  clearly  indicates  a  specific  event  in  each  period  of  a  voiced  speech 
waveform,  see  Section  3.5.  For  unvoiced  speech,  pitch  pulses  are  not  produced  by  (3.22)  so  that 
the  need  for  separate  voiced/unvoiced  decision  logic  is  eliminated.  A  summary  of  the  (scalar 
case)  pre-windowed  lattice  algorithm  with  (3.20)  to  (3.23)  incorporated  follows  on  the  next  page. 
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SUMMARY  i  PITCH  DETECTION  VARIABLE 
UNNORMALIZED  PRE- WINDOWED  LATTICE  ALGORITHM 


Initialization: 

/?#,. i  ™  «  a  priori  estimate  N  ■■  filter  order 

For  each  observation  ,  T  >  0: 
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3.5  SPEECH  DATA  RESULTS 

Some  simulation  results  obtained  with  the  new  pitch  detection  approach  using  the  variable 
tin  t  are  presented.  The  following  phonetically  balanced  sentences,  developed  by  the  Advanced 
Research  Projects  Agency  (DARPA),  were  studied. 

File  1:  'cats  and  dogs  each  hate  the  other’ ;  male  speaker 

File  2:  'the  pipe  began  to  rust  while  new'  ;  female  speaker 
Both  sentences  were  sampled  at  8000  Hi.  with  16-bit  integer  quantisation.  The  analysis  lattice 
employed  in  the  simulations  had  the  following  parameter  specifications;  Rl,-i  “  **  100,000, 

X  ■  .99,  and  N  *“  10.  This  value  of  X  corresponds  essentially  to  a  window  length  of  100  samples 
which  greatly  exceeds  the  filter  length  N  used. 

File  1 

The  speech  waveform  of  File  1  is  shown  in  Fig.  3.5.  The  voiced  segments  are  clearly  visible 
as  those  areas  of  (relatively)  large  magnitude.  The  first  2000  samples  of  this  sentence  which 
corresponds  to  the  word  'cats'  is  examined  in  detail.  The  consonant  'c'  is  visible  beginning  at 
about  sample  300  while  the  onset  of  the  vowel  'a'  occurs  near  sample  700,  see  Fig.  3.6.  The 
unvoiced  letters  'ts'  are  not  visible  in  this  plot.  The  segment  of  interest  for  pitch  detection  is  the 
vowel  /a/  since  it  corresponds  to  a  quasi-period ic  voiced  segment  of  speech.  The  goal  is  to 
extract  the  pitch  information  from  this  segment.  The  variables  used  in  rtN  T  are  shown  separately 
then  the  full  pitch  estimate. 

The  ten  reflection  coefficients  KNT  we  shown  in  Fig.  3.7.  This  combined  reflection 
coefficient  Kf/,r  corresponds  to  SIGN  (  K*N  T  K'N  T  )  where  SIGN  simply  applies  the  sign  of  Kf/,r 
(which  is  the  same  as  K’nt  -  see  (3.10))  to  the  product  in  parentheses.  A  sudden  change  occurs  in 
all  coefficients  at  the  location  of  'c'  which  is  due  to  a  change  in  the  likelihood  variable  iN  r 
(caused  by  a  change  in  the  process  statistics),  whose  influence  on  Kn,t  •»  through  (3.10)-(3.12). 
The  periodic  structure  of  the  coefficient  waveforms  is  caused  precisely  by  the  periodic  nature  of 
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the  voiced  segment  ’s’,  h  fact,  this  periodicity  appears  in  all  lattice  variables,  which  is  not 
surprising  since  they  each  contain  some  combination  of  the  forward  residuals. 

The  forward  residual  tN  T  is  shown  in  Fig.  3.8  and  its  covariance  Rf/r  iu  Fig.  3.9.  Both  of 
these  variables  appear  in  the  pitch  detection  variable  tis.T-  We  observe  that  tN  T  does 
correspond  to  a  partially  "whitened”  version  of  the  speech  waveform  of  Fig.  3.6.  Certain  events 
are  emphasized  more  than  others  so  that  the  periodic  structure  is  well  defined.  It  is  this  result 
that  permits  clearly  defined  pitch  pulses  to  be  exposed  when  eN  T  is  appropriately  masked  by 
h/t.T  and  SRfi  T.  From  the  covariance  waveform,  the  periodic  structure  consists  of  very  abrupt 
increases  and  exponentially  decaying  decreases  (due  to  X).  In  addition,  V  produces  very  little 
change  in  the  waveform  which  is  desirable  for  pitch  detection.  Both  of  these  results  are  of  course 
due  to  variance  changes  occurring  in  the  original  speech  waveform. 

The  likelihood  variable  7 N  f  waveform  is  displayed  in  Fig.  3.10.  It  is  seen  that  both  V  and 
’a’  significantly  affect  ~tN  r  or,  in  other  words,  7 n,t  detects  (equally  well)  both  types  of  statistical 
variations;  the  onset  of  the  unvoiced  plosive  V  and  the  voiced  vowel  ’a’.  This  b  desirable  for 
pitch  detection  but,  as  mentioned  previously  in  Section  4,  there  b  in  a  sense  more  informntion 
than  necessary. 

Next,  £7 n.t  rod  6Rn,t  presented  in  Figs.  3.11  and  3.12,  respectively.  As  expected  from 
Fig.  3.9,  the  dominant  pulses  of  6Rs,t  are  positive  and  little  emphasb  is  placed  on  V;  such  is  not 
the  case  with  hN.T-  However,  when  the  forward  residuals  are  masked  by  these  quantities,  well 
defined  pitch  pulse  locations  are  obtained  with  relatively  few  spurious  pulses  as  indicated  in  Figs. 
3.13  and  3.14.  Fig.  3.15  shows  further  improvement  when  hiv.T  masks  iRs,r>  but  even  better 
results  are  obtained  when  both  £7 n,t  rod  SRf/j  mask  the  forward  residual  eN  T  as  shown  in  Fig. 
3.16. 

A  more  detailed  look  at  17^7-  for  samples  1000  •  1400  shows  the  quasi* periodic  structure  of 
’a’,  Fig.  3.17.  Note  that  the  less  dominant  pulses  in  any  period  of  t)(t  T  (if  they  exist)  tend  to 
cluster  about  the  desired  dominant  pitch  pube  locations.  Hence  the  need  for  thresholding  is 
reduced  since  windowing  can  be  used  to  extract  a  pitch  pulse  location  centered  near  the  duster. 


For  comparison  with  pitch  results  obtained  with  LPC-10,  Fig.  3.18  displays  a  portion  of  tfN  T 
(samples  800  •  1800).  The  upper  row  of  numbers  corresponds  to  the  pitch  periods  obtained  with 
the  new  method  and  the  lower  row  contains  those  determined  by  the  LPC-10  algorithm;  the  dot¬ 
ted  lines  indicate  the  boundaries  of  the  180  sample  frames  for  which  the  LPC-10  pitch  periods 
were  obtained.  The  new  method  using  ijw.r  provides  results  comparable  to  those  of  the  NSA 
standard. 

The  words  'and  dogs’  (samples  3000  -  5000),  from  the  same  sentence  (File  1),  shown  in  Fig. 
3.19  were  also  analysed.  The  onset  of  ’a’  is  visible  at  about  sample  3200  with  'o’  beginning  at 
about  4600;  the  highly  sinusoidal  structure  of  the  nasal  'n*  ranges  from  3500  to  4500  and  the  two 
consonants ’d'  actually  occur  as  one  at  about  sample  4500  (’gs*  is  not  visible  in  this  plot).  Here 
the  tiN.T  doe*  not  produce  (significant)  pitch  pulses  for  much  of  the  highly  sinusoidal  structure  of 
the  nasal  ’n’,  see  Fig.  3.20.  However,  by  examining  more  closely  the  range  3600  -  4400  and  by 
changing  scales,  pitch  pulse  locations  are  indeed  present,  Fig.  3.21.  Thus  increased  dynamic 
range  results  from  the  new  pitch  detection  method  which  is  a  direct  consequence  of  the  product 
eN.T  &1n.t  &Rn.t-  This  result  is  in  a  sense  a  trade-off  required  to  obtain  such  well-defined  pitch 
pulses.  Nevertheless  this  effect  is  not  a  problem  since  pitch  pulse  locations  can  be  determined  on 
a  local  basis  by  windowing  (e.g.  50  -  200  samples)  so  that  the  range  of  t}N  T  over  a  window  length 
is  relatively  small. 

File  8 

The  pitch  period  of  female  speakers  is  typically  less  than  that  of  male  speakers  so  that  it  is 
generally  more  difficult  to  consistently  determine.  From  the  second  sentence,  the  onset  of  the 
word  ’while'  is  displayed  in  Fig.  3.22.  The  consonants  ’wh’  are  barely  noticeable  so  that  the  onset 
of  the  vowel  ’i’  occurs  almost  immediately  at  sample  1700.  Good  results  are  obtained  with  Vs.t> 
Fig.  3.23  and  3.24.  The  results  concur  with  those  of  LPC-10,  see  Fig.  3.25. 


3.6  CONCLUSIONS 


As  a  consequence  of  the  lattice  filter  algorithm,  the  information  needed  to  compute  the 
pitch  period  is  available  at  each  time  instant.  On-line  pitch  detection  is  therefore  possible  and 
moreover  additional  parallel  processing  is  not  needed  to  determine  pitch  pulse  locations.  That  is, 
the  recursions  required  to  compute  the  reflection  coefficients  that  characterize  the  parametric 
speech  model  also  compute  timultaneoutly  the  pitch  variable.  Furthermore,  the  voiced/unvoiced 
decision  is  inherent  in  the  masking  technique;  either  a  pitch  pulse  is  present  (voiced)  or  it  is  not 
(unvoiced). 

As  an  extension  of  previous  results  using  the  likelihood  variable,  the  new  method  minimizes 
the  need  for  thresholding  since  more  distinct  pitch  pulses  are  generated.  As  a  consequence  of  this, 
the  exponentially  decaying  function  used  to  determine  the  period  can  be  replaced  by  a  simpler 
windowing  technique. 
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4.  RESEARCH  ON  RECOGNITION  OF  STOP  CONSONANTS 

4.1  INTRODUCTION 

Current  speech  recognition  techniques  can  accurately  determine  the  vowels  within  a  particu¬ 
lar  word  since  vowels  are  of  relatively  long  duration  and  change  character  slowly.  Within  the 
class  of  consonants,  the  stop  consonants  are  hard  to  distinguish  due  to  their  short  duration  and 
transient  nature.  In  order  to  recognize  these  transitional  sounds,  an  estimation  technique  that  can 
track  the  changes  is  necessary.  The  approach  presented  here  utilized  the  recursive  exact  least 
square  lattice  estimation  algorithm  to  determine  an  autoregressive  model  (hence  a  spectral 
representation)  of  the  speech.  This  recursive  algorithm  updates  its  representation  at  every  speech 
sample  using  exponentially  weighted  past  data.  Thus  it  is  possible  to  track  the  spectral  changes 
in  the  speech  without  much  time  smearing.  The  experiments  performed  here  on  natural  speech 
data  were  motivated  by  an  attempt  to  better  characterize  the  fast  transitions  that  occur  in  stop 
consonants.  A  representation  based  on  trajectories  of  appropriate  speech  parameters  was 
developed  and  analyzed. 

The  region  of  first  and  second  formant  (spectral  peak)  where  each  vowel  typically  occurs 
was  determined  by  Peterson  and  Barney  [PBj.  Diphthongs  follow  a  trajectory  within  a  known 
region  between  the  vowels.  The  current  understanding  of  speech  perception  has  not  clearly 
identified  whether  spectral  characteristics  are  sufficient  to  distinguish  transient  consonants  such  as 
stops.  If  the  parameterization  for  stops  was  dependent  on  the  following  (preceding)  vowel,  the 
task  of  automatic  speech  recognition  would  be  more  difficult.  The  results  of  Stevens  and  Blum* 
stein  [BS|  indicate  that  the  place  of  articulation  for  stop  consonants  is  cued  by  spectral  properties 
in  the  10  to  20  milliseconds  period  initiated  by  the  burst  onset.  Their  studies  indicate  that  the 
spectral  properties  of  this  short  time  interval  appear  to  be  invariant  of  the  following  vowel.  This 
burst  onset  of  the  stops  lasts  less  than  160  speech  samples  at  an  8  kHz.  sampling  rate.  Once  the 
formant  transition  has  started,  the  transition  is  dependent  on  the  following  vowel  but  can  be  used 
as  a  context  dependent  cue  for  determination  of  the  consonant,  if  the  voicing  condition  is  already 
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Icnown.  Fast  estimation  techniques  are  necessary  to  determine  the  speech  spectra  over  such  a 
short  time  interval. 

The  fast  recursive  exact  least  square  lattice  algorithm  developed  by  Morf  et  al.  [LMJ  esti¬ 
mates  the  signal  spectral  by  fitting  an  autoregressive  model.  By  determining  a  new  estimate  for 
every  speech  sample  using  an  exponential  weighting  of  past  data  allows  the  estimates  to  keep  up 
with  the  short  time  signal  characteristics.  Section  4.2  describes  the  recursive  lattice  estimation 
algorithm.  This  algorithm  was  applied  to  natural  speech  words  from  a  set  of  Diagnostic  Rhyme 
Test  word.  The  voiced  stops  /b/,  /d/  and  /g/  followed  by  various  vowels,  spoken  by  a  single 
male  speaker  were  examined  in  detail. 

The  process  of  clustering  observations  should  be  insensitive  to  a  transformation  of  variables 
provided  the  distance  metric  is  appropriately  changed.  Thus  a  clustering  in  the  space  of  reflection 
coefficients  with  a  suitable  metric  is  equivalent  to  frequency  domain  clustering.  The  technique 
called  Vector  Quantization  (VQ)  was  used  in  this  study  to  perform  the  clustering  of  parameters. 
The  standard  VQ  algorithm  is  presented  in  Section  4.3.  Experiments  applying  the  technique  of 
vector  quantization,  appropriately  modified,  have  determined  a  suitable  parameterization  for  dis¬ 
tinguishing  the  stop  consonants.  These  modifications  to  the  standard  VQ  technique  are  discussed 
in  Sections  4.6  and  4.7. 

Section  4.4  discusses  the  results  of  applying  the  standard  VQ  method  to  consonant-vowel 
words.  Section  4.5  looks  at  the  differences  in  the  same  vowel  spoken  in  different  words.  Section 
4.6  introduces  an  augmented  parameterization  that  includes  information  about  reflection 
coefficient  trajectories  that  can  assist  in  classifying  stop  consonants.  Section  4.7  presents  a  new 
Classified  Vector  Quantization  method  and  its  application  to  consonant- vowel  words.  Section  4.8 
summarizes  the  results  of  our  procedure  to  recognize  the  voiced  stops,  /b/,  /d/,  /g /.  A  summary 
and  discussion  of  future  research  is  in  Section  4.9. 


4.2  RECURSIVE  LATTICE  ESTIMATION  ALGORITHM 


An  alternative  parameterization  of  an  autoregressive  model  is  in  terms  of  reflection 
coefficients  {/>,-},  in  the  lattice  filter  structure.  The  lattice  structure  can  be  related  to  the  transfer 
function  of  an  acoustical  tube  formed  from  connected  cylinders  of  differing  diameters.  The  propa¬ 
gation  of  acoustic  waves  down  the  tube  experiences  reflections  and  transmissions  at  each  discon¬ 
tinuity.  The  reflection  coefficients  of  the  lattice  filter  structure  can  be  related  to  the  signal  propa¬ 
gation  across  a  discontinuity  in  the  acoustic  tube  model.  Furthermore,  the  reflection  coefficients 
can  be  interpreted  as  correlation  coefficients  between  the  signals  in  the  two  paths  of  the  lattice 
structure.  Thus  the  process  of  estimating  reflection  coefficients  is  similar  to  orthogonalizing  the 
observed  signal  with  respect  to  its  delayed  version.  This  is  one  reason  why  spectral  estimation  by 
reflection  coefficients  has  been  shown  to  adapt  quickly.  These  techniques  have  been  used  success¬ 
fully  in  speech  analysis  and  synthesis,  fast  adaptive  equalization  and  spectral  estimation. 

Recently  developed  techniques  by  Morf  et  al.  [LM]  recursively  update  reflection  coefficient 
estimates  as  new  data  samples  are  observed  with  exponential  decay  of  past  data.  This  algorithm 
solves  for  the  exact  least  squares  fit  to  the  observed  data.  The  square  root  normalized  algorithm, 
(4.1),  has  a  very  compact  notation  and  normalizes  alt  signals  to  unit  variance  at  each  stage.  The 
response  of  this  algorithm  to  synthetic  signals  with  time  varying  characteristics  and  to  speech 
phrases  was  first  studies  in  [ML|. 


P»+i,r 


^  ~  vl,r  \/l  -  ’li.r-i  Pa*  i.r-1 
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~  Pa+l.TVa.T-l 


(4.1) 


Va.T-l  ~  Pa*l,Tva,T 
tr  ™  .  i 

"  Pl*\,T  -  va,T 

The  tracking  ability  of  the  algorithm  can  be  seen  from  the  first  four  reflection  coefficients 
computed  from  the  first  40  ms.  of  ’did’  and  ’bid’,  see  Fig.  4.1.  The  burst  of  the  /d/  or  /b/  and 
the  transition  to  the  steady  vowel  /!/  is  seen  in  the  time  waveform.  The  pitch  pulses  cause 
momentary  fluctuations  in  the  coefficient  values.  The  initial  trajectories  of  the  reflection 
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coefficients  are  seen  to  be  different,  particularly  at  higher  than  first  order.  Yet  they  all  converge 
to  similar  values  after  the  onset  burst  as  the  vowel  sound  stabilizes.  Note  that  the  vowel  oscilla¬ 
tion  commences  at  about  the  same  time  in  both  words.  The  rise  of  the  first  reflection  coefficient 
is  different  during  the  burst  onset.  The  second  reflection  coefficient  is  more  steady  in  'b'  and 
changes  suddenly  at  the  beginning  of  voicing  oscillation.  The  third  and  fourth  coefficients  have 
different  values  for  the  different  stop  consonants.  Certain  similarities  were  noted  in  the  reflection 
coefficient  trajectories  for  all  the  trial  words  starting  with  'b’,  and  likewise  for ’d’.  The  reflection 
coefficients  determine  a  spectral  representation  so  the  formants  (spectral  peaks)  can  be  estimated. 
The  second  formant  illustrates  a  rising  trend  for  V  with  less  of  a  change  for 'd'.  The  acoustic 
models  for  the  stop  consonants  differentiate  each  by  the  slope  of  the  second  formant. 


4.3  VECTOR  QUANTIZATION 


Vector  quantizers  have  been  used  for  waveform  and  voice  coding  systems.  Our  application 
of  the  vector  quantization  technique  is  to  perform  a  clustering  of  speech  sounds  into  categories 
that  can  be  identified  with  vowels  and  consonants.  First  the  general  framework  of  Vector  Quanti¬ 
zation  is  presented.  A  vector  quantizer  maps  input  vectors  drawn  from  the  A/-dimensional 
Euclidean  space  Rw  into  a  finite  set  (codebook)  of  reproduction  vectors  (codewords)  contained  in 
the  space  R*.  A  vector  quantizer  (VQ)  is  described  by  the  input  vector  dimension  (Af),  the 
reproduction  vector  dimension  (A),  the  number  of  reproduction  vectors  (TV),  the  set  of  reproduc¬ 
tion  vectors  C  *  (xlf  i  *=  1,2,  *  *  *  N},  and  the  mapping  of  the  input  space  into  the  set  of  repro¬ 
duction  vectors  q(x).  In  our  studies  the  reproduction  vector  is  of  the  same  dimension  as  the  input 
vector,  A/  «  K. 

A  VQ  used  to  compression  speech  for  transmission  requires  two  functional  blocks:  an 
encoder,  which  views  the  input  vector  x  and  generates  the  index  of  the  reproduction  vector 
specified  by  q(x);  and  a  decoder,  which  uses  this  index  to  generate  the  reproduction  vector  x,  .  A 
VQ  can  be  used  to  communicate  over  a  digital  channel  by  placing  the  encoder  at  the  transmitter 
and  the  decoder  at  the  receiver  and  sending  the  index  of  the  codeword  across  the  channel.  For 
speech  compression,  each  input  LPC  vector  is  mapped  into  a  codeword  of  log2JV  bits  per  vector. 
The  bit  rate  is  k>g2JV  bits  times  the  rate  of  generation  of  LPC  vectors.  As  the  bits  per  vector 
increases,  the  codebook  size  grows  exponentially  requiring  a  similar  increase  in  computational 
effort  and  storage  at  both  the  encoder  and  decoder.  The  decoder  stores  the  codebook  and  per¬ 
forms  the  simple  task  of  looking  up  the  reproduction  vector  indexed  by  the  encoder.  The  encoder 
has  the  more  complicated  task  of  partitioning  the  input  space  into  a  collection  of  bins  according 
to  4(x)>  ob*  bin  for  each  reproduction  vector  in  the  codebook,  and  determining  in  which  bin  an 
input  vector  is  contained. 

If  we  define  a  distortion  measure  d(x£)  which  represents  the  penalty  or  cost  associated  with 
reproducing  a  vector  x  by  A,  then  the  best  mapping  4(x)  is  the  one  which  selects  as  the  reproduc¬ 
tion  vector  for  x  the  codeword  x<  that  minimizes  d(x^i<).  With  such  a  minimum  distortion  or 


nearest  neighbor  mapping,  the  encoder  operates  by  computing  d(xpct)  for  i  — 1,2,  •  •  •  N,  and 
then  selecting  the  value  of  i  (by  a  full  search)  for  which  d(x;x,)  is  minimized.  This  implies  that 
the  bin  associated  with  a  particular  codeword  &•  is  the  set  of  input  vectors  for  which  x,  is  the 
minimum  distortion  codeword. 

Vector  quantization  applied  to  LPC  voice  coders  is  used  to  encode  and  decode  the  autore¬ 
gressive  model  generated  by  an  LPC  analysis  of  a  speech  frame.  (The  coding  of  the  excitation 
parameters  is  not  considered  here.)  The  LPC  speech  model  is  shown  in  (4.2). 

<r/(  1+  a,*"1*  a2r'2  •  •  •  +  afz~f)  ■■  ^M(^)  (4.2) 

The  order  p  used  here  is  10.  Once  the  model  parameters  {a,  at,  a2,  •  •  •  a,}  have  been  obtained, 

they  are  coded  by  means  of  vector  quantization.  The  input  vector  x  to  the  VQ  is  the  vector 

(<r  at  a2  •  •  •  a,  |r  of  model  parameters.  Each  codeword  is  a  vector  [£,  a, ,  a,^  *  ■  •  a,  f  )T  that 

represents  a  reproduction  model  (4.3). 

**/(!+  •  •  *  +  «..,**')  “  (4-3) 

The  distortion  measure  chosen  for  LPC  vocoding  is  the  modified  Itakura-Saito  distortion.  It 
can  be  regarded  as  a  measure  of  the  dissimilarity  between  the  power  spectrum  |o/A(e,#)|*of  the 
input  model  and  the  power  spectrum  |  or,/A,(e-,,)| 2  of  the  reproduction  model.  For  this  case  of 
autoregressive  modeb,  the  distortion  can  be  expressed  as 

d(x;x,)  —  +  In af  -  lmr2  -  1  ,  (4.4) 

where  is  the  vector  ( 1  au  aii3  •  •  •  iif  j r  and  R(x)  isap+lbyp+1  Toeplitz  correlation 
matrix  with  elements  {rf(k-y),  k,j  —  0,1, 

'.(*)  “  JJviM'*') (4.5) 

Since  the  last  two  terms  in  (4.4)  do  not  depend  on  £«,  they  can  be  ignored  when  finding  the 
nearest  neighbor  of  an  input  vector.  Thus  the  encoding  can  be  performed  by  computing 
A,rR(x)a,  Id*  +  ln£2  for  each  i « 1,2,  •  •  ■  N  and  picking  the  codeword  that  minimizes  this  quan- 
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tity.  This  quantity  can  be  efficiently  computed  in  the  following  manner. 


(rf  (0)r4  (0)  +  2  £  rf (m )r4  (m)|/of  +  Ino?  (4.6) 

*-1 

M-0 

Since  tbe  computation  of  distortion  between  an  input  vector  and  each  reproduction  vector, 
d(x;x,)  must  be  calculated  often,  (4.6)  b  used  to  speed  up  the  computations.  Thus  the  codewords 
are  stored  as  the  following  p+  2  scalar  quantities. 

rit(0)/erf  ,  2r4  (l)/<7? ,  2r4  (2)/o?  ,  . . . ,  2r4  (p)/of  ,  lno*  (4.7) 

The  standard  VQ  algorithm  proceeds  by  performing  the  following  operations  for  every  vec¬ 
tor  in  the  training  sequence,  see  Fig.  4.2.  First,  find  the  codeword  that  b  closest  to  each  input 
vector  and  compute  the  average  IS  dbtortion  for  all  of  the  data.  Second,  for  all  the  input  vectors 
that  are  encoded  into  a  particular  codeword,  compute  the  centroid  of  the  region  and  define  it  as 
the  new  codeword.  If  the  decrease  in  distortion  b  above  a  threshold,  repeat  the  process  again  on 
all  of  the  training  sequence.  Otherwise,  if  the  size  of  tbe  codebook  is  below  the  desired  number, 
then  generate  additional  codewords  as  perturbed  versions  of  existing  codewords. 


APPLICATION  TO  SPEECH  RECOGNITION 

For  the  speech  recognition  task,  the  VQ  technique  b  used  to  cluster  the  LPC  speech  models 
into  a  few  characterbtic  types.  The  use  of  the  Itakura-Saito  distortion  measure  provides  a  means 
to  cluster  observed  LPC  models  based  on  the  dbtance  between  their  spectra.  After  establishing 
the  VQ  codebook  on  a  training  set,  an  unknown  observation  can  be  encoded  so  its  closeness  (db¬ 
tortion)  to  each  codeword  can  be  determined. 

The  LPC  modeb  used  in  thb  study  are  parameterized  by  reflection  coefficients  rather  that 
predictor  coefficients  as  in  (4.2).  The  recursive  lattice  estimation  technique  was  used  to  determine 
a  new  LPC  model  for  every  speech  sample  rather  than  tbe  common  approach  of  once  every  128  to 
256  samples.  The  efficient  computation  of  the  IS  distortion  measure  (4.6)  uses  the  speech  correla¬ 
tion  function.  The  reflection  coefficients  can  be  transformed  into  a  normalized  correlation 


sequence,  instead  of  using  (4.5). 

The  process  of  encoding  a  speech  sequence  once  codewords  have  been  established  proceeds 
in  the  following  manner. 

(i)  Recursive  lattice  algorithm  is  applied  to  the  speech  sequence.  Each  speech  sample  gen¬ 
erates  a  set  of  reflection  coefficients,  {kltks,„. ,*io}- 

(ii)  The  reflection  coefficients  are  transformed  into  the  normalized  correlations  {r,(l),...,r,(10)}. 

(iii)  Calculate  a2  «  no-tf) 

i— 1 

(iv)  The  input  vector  x  becomes  {lnff2,!,^!),...,^!!))). 

(v)  For  each  codeword,  the  distortion  (4.6)  b  computed  using  x  and  the  codeword  description 
(4.7). 

(vi)  The  codeword  with  the  lowest  distortion  b  the  reproduction  vector  x,  and  is  associated  with 
that  speech  sample. 
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4.4  ANALYSIS  OF  ENTIRE  WORDS 

In  order  to  better  understand  the  effect  of  the  lattice  VQ  on  spoken  words,  our  studies 
began  by  examining  entire  words.  The  square  root  normalized  recursive  least  square  lattice  algo¬ 
rithm  was  applied  to  the  speech  signal.  A  short  time  constant,  X«=159/160  was  used  to  track  the 
fast  variations  in  the  speech  waveform,  particularly  during  the  stop  consonant  portion.  A  set  of 
ten  reflection  coefficients  were  determined  for  every  speech  sample.  The  reflection  coefficients 
were  transformed  into  normalized  correlation  coefficients  of  order  ten  so  that  the  standard  VQ 
algorithm  could  be  used  to  obtain  the  codewords.  The  results  of  studying  the  two  words  ’bad’ 
and  ’bat’  are  presented  in  this  section.  Each  word  was  sampled  at  8  Khz.  and  converted  to  a  12 
bit  integer.  The  duration  of  each  word  was  more  than  4500  samples  so  that  more  than  4500  vec¬ 
tors  were  used  in  the  determination  of  the  codebook.  This  is  contrary  to  the  standard  LPC 
method  of  determining  a  single  speech  model  vector  for  blocks  of  128  to  256  speech  samples. 

The  standard  VQ  approach  uses  the  Itakura-Saito  distortion  measure  to  indicate  how  well 
the  codewords  fit  the  input  data.  Another  distortion  measure  was  used  to  compare  codewords. 
The  difference  between  the  log  of  the  spectra  associated  with  the  codewords  was  computed,  called 
the  spectral  difference  measure.  The  limit  of  perceptual  difference  in  two  (autoregressive)  spectra 
was  determined  for  subjective  studies  to  be  2  db  spectral  difference. 

The  standard  VQ  algorithm  was  used  to  find  codebooks  of  size  four  and  eight  for  the  entire 
word,  ’bad’  and  ’bat’.  When  four  codewords  were  used,  the  word  'bad'  was  encoded  into  these 
codewords  as  shown  in  Fig.  4.3.  This  figure  shows  which  codeword  was  chosen  (vertical  axis)  for 
each  time  sample  (horizontal  axis).  From  Fig.  4.3  and  4.4,  the  speech  waveform  and  the  VQ  par¬ 
tition  can  be  compared.  Generally,  there  were  two  codewords  for  the  vowel  /a/,  the  other  two 
representing  the  other  parts  of  the  words.  No  codeword  was  determined  that  represented  the  stop 
consonant  /b/.  Here  the  codewords  could  not  be  used  to  distinguish  the  silence,  the  first  stop 
consonant,  the  vowel,  or  the  final  consonant.  For  these  four  codewords,  the  Itakura-Saito  distor¬ 
tions  were  .165  and  .170,  respectively  and  the  difference  between  codewords  are  all  greater  than  3 
db,  so  the  four  codewords  are  distinct. 
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When  eight  codewords  were  determined,  the  vowel  part  was  more  accurately  determined  but 
again  a  clear  identification  of  the  silence  and  the  consonant  were  not  make.  The  distributions  of 
codewords,  Fig.  4.5  and  4.6  show  that  three  codewords  represented  the  different  stages  of  the 
vowel.  In  Fig.  4.5  for  the  words  'bad',  codewords  one  and  five  are  used  alternatively  during  the 
vowel.  This  happens  because  these  codewords  are  only  2.1  db  spectral  difference  apart  and  hence 
not  perceptually  distinguishable  entities.  Similarly  codewords  three  and  seven  are  2.3  db  apart. 
Thus  although  the  IS  distortion  for  ’bad’  has  dropped  to  .099  for  eight  codewords  from  .165,  the 
additional  codewords  try  to  refined  the  specification  of  the  vowel  rather  than  distinguish  other 
parts  of  the  words.  The  spectral  differences  between  the  codewords  is  given  in  Table  1. 

From  the  above  experiments,  we  could  not  determine  the  codewords  for  the  various  parts  of 
the  words.  Therefore,  the  steady  state  vowel  part  was  extracted  from  each  word  and  studied 
separately. 


codeword  number  speech  waveform 
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.  000 

3.  977 

4.  385 

6.  770 

2.  153 

3.  880 

3.  347 

5.  500 

3.  977 

.  000 

3.  649 

8.  047 

3.  645 

5.  261 

3  932 

6.  891 

4.  385 

3.  649 

.  000 

9.  563 

5.  322 

7.  423 

2.  333 

8.  321 

6.  770 

8.  047 

9.  563 

.  000 

6.  015 

6.  359 

10.  319 

3.  376 

2.  153 

3.  645 

3.  322 

6.  015 

.  000 

3.  433 

5.  823 

3.  403 

3.  880 

3.  261 

7.  423 

6.  359 

3.  453 

.  000 

7.  954 

4.  806 

5.  347 

3.  932 

2.  353 

10.  319 

3.  823 

7.  934 

.  000 

9.  277 

3.  500 

6.  891 

8.  331 

3.  376 

5.  403 

4.  806 

9.  277 

.  000 

spectral  differences  (db)  of  8  codewords  for  /bad/ 


.  000 

8.  381 

5.  954 

9.  669 

3.  591 

7.  109 

6  439 

10.  039 

8.  381 

.  000 

3.  196 

2.609 

7.  264 

3.  008 

4  150 

4.  317 

5.  954 

5.  196 

.  000 

6.  804 

3.  908 

3.  142 

4.  018 

7.  722 

9.  669 

2.  609 

6.  804 

.  000 

7.  773 

4.  933 

4.  337 

1.  966 

3.  591 

7.  264 

5.  908 

7.  773 

.  000 

6.  740 

4.  700 

7.  661 

7.  109 

3.  008 

3.  142 

4.  933 

6.  740 

.  000 

4.  336 

6.  232 

6.  439 

4.  150 

4.  018 

4.  337 

4.  700 

4.  336 

.  000 

4.  533 

10.  039 

4.  317 

7.  722 

l.  966 

7.  661 

6.  252 

4.  533 

.000 

spectral  differences  (db)  of  8  codewords  for  /bat / 


Table  1 


4.5  ANALYSIS  OF  VOWELS 


Since  the  vowels  dominated  the  previous  experiments,  the  steady  state  vowel  parts  of 
several  different  words  were  studied  to  find  the  general  codewords  representing  the  voweb.  From 
the  words,  ’bad’,  ’bat’  and  ’gat’  the  steady  state  vowel  portions  were  extracted  for  a  training 
sequence  to  generate  a  codebook  for  the  vowel  /a/.  Similarly  the  steady  state  parts  in  the  words 
’bid’,  'bit',  ’did’,  'dip'  was  used  for  the  vowel  /*/  and  ’boast’,  ’bowl’,  ’dole’,  and  ’ghost’  was  used 
for  the  vowel  /o/.  When  only  one  codewords  was  determined  for  each  vowel,  the  codewords  were 
surprisingly  similar.  Table  2  shows  that  the  codewords  for  different  vowels  differ  between  3  and 
4.4  db.  For  words  containing  the  same  vowel,  the  codewords  for  the  same  vowel  sometimes 
differed  as  much  as  the  difference  between  /a/,  /o/  and  /*/  in  Table  2. 


TABLE  2:  Spectral  difference  between  voweb 


Codeword 

N 

N 

N 

N 

0.00 

4.39 

3.29 

N 

4.39 

0.00 

3.00 

N 

3.29 

3.00 

0.00 

When  four  codewords  were  used  for  the  steady  state  part  of  the  voweb,  the  codewords  for 
the  same  vowel  in  different  words  were  often  different,  see  Fig.  4.7,  4.8  and  4.9.  Often  a  vowel 
was  split  into  two  codewords,  one  for  the  beginning  and  another  for  the  end.  For  /i/,  the  begin¬ 
ning  of  the  vowel  is  represented  by  codewords  3  and  4,  and  the  end  of  the  vowel  is  codewords  1 
and  2.  Some  of  the  four  codewords  were  quite  similar,  for  example  in  /of  codewords  1  and  2  and 
codewords  2  and  4  are  less  than  2  db  apart,  see  Fig.  4.9.  When  eight  codewords  were  used  for  the 
/a/  vowel,  many  of  them  were  very  similar,  see  Table  3,  therefore  it  is  appropriate  to  use  four 
codewords  to  represent  different  stages  of  the  same  vowel. 
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These  codebooks  for  the  respective  vowels  were  tested  to  see  if  they  could  distinguish  the 
correct  vowel.  When  the  training  sequences  of  the  vowels  was  encode  by  each  codebooks,  a  IS 
distortion  was  determined,  see  Table  4.  The  IS  distortion  for  a  vowel  codebook  on  the  wrong 
vowel  was  at  least  four  times  higher  than  for  the  correct  vowel.  Therefore,  it  is  not  very  difficult 
to  distinguish  the  vowel  in  each  word  using  the  standard  VQ  technique. 


TABLE  4:  Itakura-Saito  distortion  between  vowels 


Vowel 

N 

H 

N 

codebook  /a/ 

.058 

.460 

.340 

codebook  /o/ 

.418 

.062 

.490 

codebook  /»/ 

.365 

.447 

.082 

-72- 


4.6  MODIFIED  VQ  WITH  TRAJECTORY  INFORMATION 

Prom  the  previous  results,  the  vowels  are  not  hard  to  distinguish  because  they  are  relatively 
stationary  and  of  long  duration.  However,  the  stop  consonants  (like  'b’t ’d'  and  ’g’)  are  transient 
in  nature  and  are  of  short  duration,  typically  less  than  20  ms.  (160  samples).  Using  a  LPC  baaed 
VQ  system  that  determines  speech  model  parameters  every  128  (to  256)  samples  would  not  yield 
enough  information  to  identify  these  very  short  sounds.  From  studies  in  acoustic  phonetics,  it  is 
known  that  the  formant  trajectories  of  these  consonants  follow  different  paths.  If  the  parameteri¬ 
zation  used  to  represent  speech  sounds  included  information  about  formant  trajectories,  these 
transitional  sounds  would  be  easier  to  identify.  As  seen  in  Section  3,  the  trajectories  of  the 
reflection  coefficients  were  different  for  the  beginnings  of  the  words  'did*  and  ’bid’.  A  steep 
change  occurred  during  the  initial  consonant  while  during  the  steady  state  vowel,  very  slowly 
changing  coefficients  resulted.  By  incorporating  this  trajectory  information  in  the  speech  parame¬ 
terization,  recognition  of  transitional  sounds  should  be  improved. 

The  trajectory  of  a  reflection  coefficient  was  determined  as  a  smoothed  derivative.  During 
the  vowels,  the  reflection  coefficients  had  a  ripple  due  to  the  pitch  period.  This  ripple  in  the  oth¬ 
erwise  steady  reflection  coefficient  values  had  an  undesirable  influence  in  the  modified  VQ 
approach.  Thus  a  linear  approximation  over  15  sample  points  to  the  derivative  of  the  reflection 
coefficients  was  used  for  the  trajectory  information.  The  fluctuations  due  to  the  influence  of  the 
pitch  were  smoothed  out.  The  trajectories  of  the  first  and  second  order  reflection  coefficients, 
denoted  At,  appeared  to  be  the  most  indicative  of  changing  signal  characteristics  so  they  were 
included  in  the  modified  VQ  technique.  The  standard  VQ  algorithm  of  Section  3  was  modified  so 
that  the  codewords  consist  of  two  parts;  the  original  correlation  coefficients  and  the  trajectory  of 
the  reflection  coefficients.  The  distortion  measure  used  for  the  spectrum  part  in  the  modified  VQ 
was  still  the  IS  distortion.  The  Euclidean  norm  was  used  as  the  distortion  measure  for  the  two 
reflection  coefficient  trajectories.  The  total  distortion  was  the  weighted  sum  of  the  IS  distortion 
and  the  Euclidean  norm  of  the  trajectories.  The  centroid  (rfk,)  was  calculated  as  the  averages  of 
the  reflection  coefficient  trajectories.  A  weighting  factor  for  the  Euclidean  norm  was  used  to  bat- 


ance  the  two  distortion  measures.  Thu  factor  is  the  ratio  of  the  minimum  IS  distortion  (Kl)  to 
twice  the  variance  of  the  reflection  coefficient  trajectories  (D). 

MODIFIED  VQ  ALGORITHM 

N  a  number  of  the  input  tamplet 
Afc,(n)  a  reflection  coefficient  trajectory  at  tample  n 
ik,  a  codeword  for  reflection  coefficient  trajectorieo 
IS  a  Itakura-Saito  diotortion 
M  a  minimum  IS  diotortion 

Dj  a  variance  of  reflection  coefficient  trajectorieo 


INITIALIZING : 


dkx  a  o  dk2  a  0  M  a  0 

»*— 1,2 

/v  »=i 


ENCODING:  choooe  the  codeword  that  minimizeo  the  total  distortion 

total  distortion**IS+  --yf  [Akl-dklf+  ~—{Ak2-dk2)2 
2I/|  2Z/2 

UPDATING :  0,  a 

M  «  min  /S  distortion 
avpdist(dki)  -  -~-^'£{&ki-dki)2 

total  distortion  a  min  /5-f  ^-avgdist(dk2) 


MEW  CODEWORDS:  compute  the  centroids  of  the  standard  VQ  parameters  and  Akt 

TESTING:  if  relative  decrease  of  distortion  <  threshold  :  go  to  splitting 

else  :  go  to  encoding 

SPLITTING:  if  number  of  codewords  a  size  of  codebook  :  stop 

else  :  split  codewords 
go  to  encoding 
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The  modified  VQ  approach  was  first  applied  to  simulated  data  that  represent  the  ideal 


acoustical  models  of  the  stop  consonants.  The  simulation  of  the  sound  ’ba’  had  two  poles  (750  Hz 


and  1600  Hz)  for  the  steady  state  vowel  while  the  first  formant  went  from  200  Hz  to  750  Hz  and 


the  second  formant  went  from  1400  Hz  to  1600  Hz  in  the  first  20  ms  (160  samples)  of  the  transi¬ 


tional  part.  The  reflection  coefficient  trajectories  were  approximately  constant  in  the  transition 


region  and  zero  in  the  steady  state  region.  The  reflection  coefficients  of  fourth  order  were  com¬ 


puted  from  the  simulated  data.  When  the  size  of  the  codebook  was  two,  the  result  turned  out  to 


be  perfect,  the  two  partitions  were  exactly  the  transitional  part  and  the  steady  state  part.  Next 


the  considerably  more  difficult  problem  of  real  speech  data  was  studied  using  this  modified  VQ 


approach. 


In  order  to  find  the  codeword  for  the  consonant  V,  the  beginnings  of  three  words  which 


start  with  'ba'  ('bad',  'bat'  and  'bank')  was  cascaded  and  used  to  generate  the  codebooks  of  size 


eight  of  both  modified  and  standard  VQ  (Table  5  and  6).  The  IS  distortions  were  very  close  in 


these  two  codebooks.  The  distributions  of  the  modified  codewords  are  in  Fig.  4.10.  Basically,  one 


codeword  was  used  for  the  silence  (codeword  3)  and  one  for  the  transition  (codeword  8).  The 


other  six  codewords  represented  the  vowel.  A  consistent  pattern  of  change  from  the  codeword  for 


silence  (3)  to  the  same  codeword  (8)  occurred  at  the  transition  time  in  all  of  these  three  words. 


This  effect  was  not  seen  in  the  standard  VQ  (Fig.  4.11).  Instead,  at  the  beginning  of  each  word, 


several  codewords  were  used  before  reaching  the  vowel.  It  appeared  that  there  were  too  many 


codewords  for  the  steady  state  parts,  so  the  size  of  both  codebooks  was  reduced  to  four.  Surpris¬ 


ingly,  the  difference  of  the  vowel  /a/  in  different  words  was  so  important  that  three  different  code¬ 


words  were  used  for  the  same  vowel  in  three  different  words.  The  other  one  represented  the  tran¬ 


sitional  parts  while  the  leading  silence  was  encoded  as  a  vowel. 


Going  through  exactly  the  same  procedures  but  using  'gab',  'gafT  and  ’gat’  for  'ga',  different 


types  of  problems  were  encountered.  In  the  case  of  eight  modified  codewords,  there  was  one  for 


the  silence  and  still  too  many  for  the  vowel,  but  it  was  very  'unstable'  at  transient  time.  This  did 


not  happen  in  the  standard  VQ.  But  the  standard  VQ  mapped  most  of  the  beginning  of  'b'  into 


.■> 
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the  silence.  If  four  modified  codewords  were  used,  there  was  one  for  silence  (codeword  1)  and  the 
same  two  codewords  (2,4)  representing  two  stages  of  the  vowel  in  three  different  words  as  in  Fig. 
4.12.  This  was  better  than  that  in  ’ba',  But,  it  still  alternated  between  two  codewords  (1,3)  at 
transient  time  again.  The  unstability  persisted  in  the  modified  VQ  but  not  in  the  standard  VQ 
(Fig.  4.13).  Comparing  Fig.  4.12  and  4.13,  if  the  effects  of  the  reflection  coefficient  trajectories  is 
included,  the  transition  can  be  detected  earlier  at  the  beginning  of  each  word.  However,  the  stan¬ 
dard  VQ  assigned  all  the  samples  of  ’b'  to  the  codeword  for  silence  (1). 

The  differences  of  the  same  vowel  in  different  words  were  very  large  so  that  many  codewords 
were  used  to  represent  the  same  vowel.  To  find  the  typical  codeword  for  the  stop  consonants,  the 
strong  influence  of  the  following  vowel  had  to  be  diminished.  This  lead  to  a  classified  VQ  algo¬ 
rithm  where  the  number  of  codewords  used  for  vowels  was  restricted  so  more  codewords  would  be 
determined  for  the  transitional  parts. 
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1  .666177 

123740 
.  3092724-01 
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-.  262139 
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- 
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Modified  VQ  codewords  for  /ba/ 

(last  two  entries  are  trajectories  of 

reflection  i 

Table  5 

i 

.  302182 
.  2334796-01 
-.  2498334-02 

.  134637 
-.  6109384-02 
-.  3939394-02 

. 119021 
-.  8761206-02 
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>’ 
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■ 
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.  799472 
-.  198813 
.  1341  SB 

-.  437694 
-.  270308 
.  1011714-03 

-.  238343 
.  210387 

-.  348869 
. 319893 
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.  873909 
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. 441683 

-.  383472 
. 383229 

! 

.  881331 
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V 

’  i 
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i  -.  1C4307 
j  . 187232 

-.  417306 
-.  182997 
-.  3399334-01 

-.  793471,-01 
. 166793 
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. 348844 

* 

’j 
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920311 
.  133437 
.  139892 

-.  768690 
.  6770664-01 
-.  8229164-01 

-.  930438,-01 
. 103970 

-.  9230086-01 
.  117992 

Standard  VQ  codewords  for  /ba/ 

Table  6 
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4.7  CLASSIFIED  VQ 

A  classified  VQ  design  allows  a  time  varying  signals  to  be  divided  into  different  components 
where  each  component  is  quantized  to  a  desired  accuracy.  When  VQ  is  applied  to  a  spoken  word, 
the  codewords  represent  primarily  the  vowel  sounds  since  they  are  the  longest  and  most  station¬ 
ary  sounds,  see  Section  4.4.  When  only  vowels  are  quantized,  there  is  a  difference  between  repeti¬ 
tions  of  the  same  vowel  in  difference  words.  This  difference  can  be  similar  to  the  difference 
between  nonvowel  sounds  and  vowels.  Since  the  stop  consonants  are  short  transitional  sounds, 
codewords  must  be  explicitly  allocated  to  represent  them  if  they  are  to  be  identified.  The 
classified  VQ  approach  separates  consonant-vowel  words  into  a  few  codewords  for  the  vowel  and  a 
few  codewords  for  the  silence,  consonant  and  vowel  transition. 

The  classification  procedure  uses  codewords  determined  for  a  steady  state  vowel  to  separate 
a  word  into  a  ’vowel’  part  and  a  'transitional'  part.  This  ’transitional'  part  is  used  to  defined  a 
codebook  that  can  identify  the  stop  consonant,  see  Fig.  4.14.  Four  codewords  were  determined 
for  the  steady  state  part  of  a  vowel  (using  different  words)  as  in  Section  4.5.  Then,  the  training 
sequences  of  similar  consonant- vowel  words  were  encoded  by  that  vowel  codebook  to  find  the  best 
codeword  for  each  speech  sample.  If  the  distortion  was  below  a  certain  threshold,  the  sample  was 
assigned  to  that  codeword.  If  the  distortion  was  above  the  threshold,  the  sample  was  put  in  a 
’transitional’  group.  After  this  classifying  procedure,  the  'transitional'  group  contained  the  sam¬ 
ples  for  silence,  stop  consonants  and  the  beginnings  of  the  vowel.  A  few  sample  points  of  the 
steady  state  vowel  part  were  occasionally  included.  A  codebook  for  the  transitional  part  was 
designed  so  that  four  codewords  could  be  forced  for  these  transient  sounds. 

The  steady  state  vowel  parts  of  six  words  ('bad',  'bat',  ’dab’,  'gab',  'gaff'  and  ’gat’)  were 
combined  as  the  training  sequence  to  design  a  codebook  of  size  four  for  the  vowel  /a/  using  the 
standard  VQ  algorithm.  The  threshold  for  accepting  each  codeword  was  twice  the  average  distor¬ 
tion  of  that  codeword.  The  training  sequence  of  ’ba’  was  classified  in  this  way  where  those  sam¬ 
ples  assigned  to  the  fifth  codeword  are  in  the  ’transitional’  group,  see  Fig.  4.15.  For  ’bad’  and 
’bat’  only ,  this  group  consisted  of  the  first  500  samples  from  each  word  and  very  few  of  the  vowel. 


But  almost  all  the  samples  in  ’bank’  belonged  to  this  ’transitional’  group.  The  nasal  consonant 
V  affects  the  vowel  so  that  the  steady  state  part  of  ’bank’  was  quite  different  from  that  in  all  the 
other  words,  i.e.  ’an'  is  different  from  ’a’.  Therefore,  only  the  beginning  of  the  words  'bad*  and 
’bat’  were  used  in  the  following  study.  Four  codewords  were  designed  for  the  transitional  parts  of 
these  two  words  using  the  standard  and  modified  VQ  algorithm,  Fig.  4.16.  Using  the  standard 
VQ,  there  is  one  codeword  for  the  vowel  /a/  (2),  one  for  the  silence  (1),  one  for  the  transition  (3) 
and  the  other  one  was  between  transition  and  steady  state  (4).  The  codeword  for  the  vowel  (2)  in 
this  codebook  was  similar  to  one  of  the  codewords  in  the  codebook  of  /a/.  The  distributions  of 
those  codewords  using  the  modified  VQ  still  has  one  for  the  silence  (1),  one  for  the  transient  part 
(4)  and  one  for  the  vowel  (2).  Codeword  3  represents  a  very  few  samples  between  the  silent  part 
and  transient  part.  In  the  very  beginning  of  these  words,  the  codeword  for  silence  (1)  and  code* 
word  (3)  alternate.  This  is  natural  occurrence  since  there  is  no  definitive  boundary  between 
silence  and  the  stop  consonant.  Comparing  the  standard  and  modified  VQ  (Fig.  4.16),  at  the 
beginning  of  each  word  the  first  sample  not  encoded  into  silence  occurs  earlier  in  the  modified  VQ 
method.  The  modified  algorithm  can  detect  the  transition  from  silence  to  consonant  earlier  than 
the  standard  VQ. 

A  similar  approach  was  applied  to  the  training  sequence  of  'ga'  (from  ’gab’,  ’gaff’  and  ’gat’). 
There  were  about  3000  samples  in  the  ’transitional’  group  as  in  Fig.  4.17.  Besides  the  beginnings 
of  those  three  words,  some  samples  from  the  steady  state  part  of  'gab'  were  included.  These  sam¬ 
ples  were  used  to  design  a  codebook  of  size  four.  For  the  standard  VQ  ,  there  was  a  codeword  for 
silence,  for  the  transition,  for  the  vowel,  and  one  for  these  samples  from  the  middle  of  the  vowel 
part  of  'gab'.  Codewords  for  the  vowel  were  quite  similar  to  those  in  the  codebook  for  the  steady 
state  /a/.  For  the  modified  VQ,  the  beginning  of  'gat'  was  very  different  from  those  of  'gab'  and 
'gaff*.  The  codewords  represent  the  silence,  the  transition,  the  vowel,  and  a  codeword  for  the 
beginning  of  'gat'. 

When  the  beginnings  of  'dab'  and  'dan'  were  encoded  by  the  vowel  codebook,  the  effect  of 
the  nasalized  vowel  was  seen  again,  see  Fig.  4.18.  The  final  part  of  the  vowel  in  'dan'  was 


-83- 


influence  by  the  following  nasal  consonant  'n'  such  that  the  distortion  was  more  than  twice  the 
expected  distortion  for  that  codeword.  Therefore  the  end  of  the  vowel  in  'dan'  was  not  included 
in  the  ’transitional’  group.  In  this  case,  the  distributions  of  codewords  in  both  the  standard  and 
modified  VQ  were  exactly  the  same.  The  smoothed  derivatives  of  reflection  coefficients  did  not 
make  any  differences  in  this  case. 

The  same  experiments  were  repeated  for  different  vowels;  ’bo’  (from  'boast',  ’bone’  and 
’bowl'),  ’do'  (from  ’dole’,  ’dough’  and  ’doze’)  and  ’go’  (from  ’ghost’,  ’goat’  and  ’go’).  The  distri¬ 
butions  of  the  codewords  for 'd'  and  ’g'  in  the  standard  and  modified  VQ  were  very  similar.  But 
for  ’b’,  they  were  significantly  different  at  the  very  beginnings  of  each  word. 

In  general,  the  classified  VQ  enabled  us  to  have  more  codewords  for  the  transitional  parts. 
Also,  in  some  cases,  the  codebooks  including  the  reflection  coefficient  trajectories  could  detect  the 
transient  parts  better  than  using  the  standard  VQ  only.  Since  these  results  were  quite  promising, 
a  test  of  recognizing  stop  consonants  could  be  performed. 
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4.8  RECOGNITION  OF  VOICED  STOP  CONSONANTS 

The  classified  VQ  approach  was  tested  as  a  means  of  recognizing  the  voiced  stop  con¬ 
sonants,  /b/,  /d/,  and  /g/.  The  codebooks  were  computed  as  in  the  previous  section.  The  dis¬ 
tortion  was  computed  for  each  ’transitional'  codebook  applied  to  the  beginning  of  a  test  word. 
The  test  assumes  that  the  corect  vowel  has  been  identified  since  the  ’transitional’  codebooks  for 
the  various  stop  consonants  depend  on  the  following  vowel. 

For  each  word  beginning  with  ’ba’,  ’ga\  or  ’da’,  the  transitional  part  was  determined.  The 
standard  and  modified  VQ  codebooks  for  'ba',  ’ga’  and  ’da',  were  applied  to  each  word  to  com¬ 
pute  the  distortion,  Table  7.  The  IS  distortion  using  the  wrong  codebooks  was  at  least  twice  that 
using  the  right  codcbook.  By  choosing  the  codebook  with  the  minimum  IS  distortion,  the  correct 
stop  consonant  was  always  determined.  Using  the  modified  VQ,  the  correct  consonant  was  chosen 
but  the  contribution  to  the  distortion  from  the  reflection  coefficient  trajectories  was  not  always 
consistent.  Most  of  the  time,  the  IS  component  of  the  distortion  dominated  the  distortion  due  to 
the  trajectory.  This  suggested  that  the  weighting  factor  in  this  encoding  process  might  need  to  be 
changed. 

The  same  experiments  were  repeated  on  those  words  with  vowel  / o/  and  /i/  and  these 
results  are  in  Table  8  and  9.  The  standard  VQ  approach  always  chose  the  correct  stop  consonant. 
Using  the  modified  VQ,  the  results  were  correct  except  for  the  word  ’boast’  where  the  codebooks 
for  ’b’  and 'd'  were  confused,  and  for  the  word  'gilt'  where  the  total  distortion  for  the  codebooks 
for  ’b’  was  slightly  less  than  for  'g'.  In  both  cases,  the  IS  component  of  the  modified  VQ  distor¬ 
tion  indicated  the  correct  consonant.  However  the  distortion  due  to  the  trajectories  was  lower  on 
the  wrong  codebook.  The  total  distortion  was  only  slightly  lower  for  the  wrong  codebook  than  for 
the  corect  codebook.  This  points  to  a  problem  with  the  weighting  factor  that  is  used  to  combine 
the  two  distortions. 

Under  the  conditions  of  our  study  where  the  same  word  was  used  to  train  the  VQ  and  test 
for  recognition,  the  correct  consonant  was  identifiable  once  the  following  vowel  was  known.  Most 
of  the  silence  prior  to  the  spoken  word  was  removed  from  the  training  sequence.  However  there 
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bovl 
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goat 

.2661 

.2640 

.2085 
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•b*  .1546 

.1425 
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.5820 
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,  .1143 
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.1514 
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.0694 
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1.4526 
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.0864 
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'd  *  .1410 

.5471 
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.0756 

.0845 

.0624 
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.1071 

.1043 

j  .1117 
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•b’  .0951 

.0900 
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.5361 

•g'  1.0259 

1.1914 

.6717 

.0708 

.1218 

■d'  .1387 

.5447 

.2087 

.5244 

.4355 

Table  8  Recognition  results  of  stop 
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dole 

dough 

doze 

modified  VQ 

.7310 

.4602 

.3767 

.4335 

total 

.6219 

.3233 

.2343 

.3019 

IS 

.1293 

.1304 

.1355 

.1343 

dlcl 

.0888 

.1433 

.1494 

.1290 

dk2 

.2032  1 

.1981 

1.1766 

.6278 

.1111  1 

.0673 

1.0267 

.5230 

.0848 

.1483 

.1901 

.0976 

.0993 

.1134 

.1097 

.1121 

.4601 

.2171 

.2005 

.2170 

.3615 

.1176 

.1011 

.1317 

.1052 

.1025 

.0950 

.1027 

.0920 

.0964 

.1038 

.0679 

standard  VQ 

.5348 

.2273 

.1641 

.2670 

IS 

1073 

1.0545 

.9934 

.5215 

3603 

.1139 

.0992 

.1273 

consonants  with  vowel  /o/ 
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gilt 

.19115 
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. 40434 

.35715 
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.12342 

. 08343 
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.38263 

.36857 

V 

.72942 

.30031 

.26439 

.23271 

.17145 

.08522 

. 06536 

.09456 
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. 09745 

.17113  1 

.17715 

. 29257 

.62178 

.61987 

.38270 

•d’ 
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. 50658 

.27590 

.09342 

. 1 1707 

. 12942 

.12069 

.10318 

.13196 

.09718 

.09292 

did 

dip 

modified  VQ 

.35072 

.31618 

total 

.26580 

.24083 

IS 

.06890 

.07170 

dlcl 

.10093 

.07900 

d*2 

1 . 05777 

. 63908 

.89151 

.49977 

.09846 

. 09863 

.23406 

.18000 

.19346 

.19155 

.09713 

.09969 

.09438 

.09432 

.09827 

.08942 

was  a  short  varying  amount  of  silence  in  each  word  which  often  was  represented  by  a  codeword 
and  thus  added  to  the  distortion.  When  the  distortions  contributed  by  the  samples  encoded  by 
the  codewords  for  silence  were  excluded  from  the  average  distortions,  almost  all  the  samples  in 
the  stop  consonant  portions  were  thrown  away  if  the  wrong  codebooks  were  used  because  most  of 
the  transitional  parts  were  mapped  into  the  codewords  for  silence.  Thus,  all  the  information  from 
the  consonants  was  lost.  If  an  appropriate  way  to  exclude  silence  was  possible,  it  would  help  in 
the  recognition  the  stop  consonants. 
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dab 
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.  14755 

.16939 

. 40074 

.40506 

.30927 

.39356 

total 

•b-  .10520 

.11550 

.28296 

.31468 

.20882 

.32280 

IS 

.03046 

.02204 

.11367 

.06941 

.08200 

.03104 

dki 

.05425 

. 08573 

.12189 

.11134 

.11891 

.11049 

dk2 

.51074 

. 45589 

.05414 

.04417 

.06483 

. 08720 

*g-  .48234 

.43019 

.03491 

. 02834 

.05054 

. 06337 

.02756 

.02129 

.01621 

.01369 

.01136 

.02313 

.02923 
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.02226 

.01797 

.01722 

.02453 
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. 48383 
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.09300 

. 02672 
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'd ‘  .53016 

.46423 

.06391 
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.03168 

.02037 

.01633 

.01405 

.01042 
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.01596 

.02193 

. 02288 

.01777 

.01254 

.01477 

.01925 

.07782 

.10744 

.14382 

.15048 

.09005 

.  18869 

standard  VQ 
IS 

'S’ 

. 53383 

.43642 

.03461 

.02368 

.05351 

.07491 

•d’ 

.52963 

.46376 

.06092 

.07958 

.01207 

.03073 

Table  7  Recognition  results  of  stop  consonants  with  vowel  /a/ 
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4.0  SUMMARY  AND  FUTURE  WORK 

This  research  effort  has  lead  to  an  understanding  of  the  combination  of  recursive  estimation 
with  vector  quantization.  The  ability  to  track  quickly  changing  signal  characteristics  and  classify 
them  into  a  small  number  of  signal  types,  provides  a  powerful  signal  processing  tool.  The  addi¬ 
tional  information  provided  by  trajectories  of  coefficients  was  useful  to  separate  steady  state  and 
transitional  signal  segments.  The  classified  VQ  approach  allows  different  signal  segments  to  be 
quantized  (or  clustered)  into  a  varying  number  of  levels.  For  speech  recognition,  particularly  for 
phoneme  based  approaches  where  quickly  changing  consonants  must  be  identified,  this  method 
appears  very  useful.  A  method  of  recognizing  the  >  oiced  stop  consonants,  /b/,  /d /,  and  /g/  was 
developed  and  tested  using  this  approach.  For  the  limited  data  base  of  words,  the  method  accu¬ 
rately  determined  the  consonant  for  various  following  vowels. 

Future  research  activities  would  include  an  investigation  of  the  combined  recursive  estima¬ 
tion  and  vector  quantization  for  speech  transmission,  an  extended  look  at  the  recognition  problem 
to  reduce  the  effect  of  the  following  vowel,  and  a  recognition  test  using  a  larger  data  base.  There 
is  considerable  potential  for  theoretical  developments  in  combined  recursive  estimation  and  quant¬ 
ization,  use  of  parameter  trajectories  for  signal  classification  and  ’adaptive’  vector  quantization 
using  the  classification  approach. 
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S.  SUMMARY 

Darias  the  coarse  of  this  research  coe  tract,  estimatioa  techniques  for  processes  that  cob  tail 
Gaassiaa  aoise  aad  jamp  conpoaents,  aad  classificatioa  methods  for  transitional  sisaals  by  asiag 
recars hre  estimatioa  with  vector  quantisation  were  stadied.  New  theoretical  techaiqaes  were 
developed  aad  practical  application  considered.  Experience  was  gained  ia  recars ive  estimatioa 
and  vector  qaaatisatioa  techaiqaes  aad  aa  investigation  of  their  combined  ase  was  began. 

Three  technical  reports  were  issued  daring  this  project.  The  flrst  report,  M795-1  presented 
a  detailed  dneussioa  of  *Simaltaaeoas  Jamp  Excitation  Modeling  aad  System  Parameter  Estima¬ 
tion”.  The  second  report,  M736-2  presented  aa  overview  of  recarshre  least  aqnares  estimatioa  and 
lattice  filters.  This  final  technical  report  is  the  third  report  aad  focased  on  pitch  estimatioa  aad 
stop  coasoaaat  recogaitioa.  Here  ia  this  last  report,  the  combination  of  recursive  estimation  aad 
vector  quantization  is  stadied  for  the  first  time. 

It  is  oar  intent  to  continue  studying  signal  processing  techaiqaes  that  utilize  the  fast  track¬ 
ing  nature  of  recursive  estimatioa  aad  the  efkieat  classification  features  of  vector  quantization. 
Hopefully,  future  contracts  will  allow  us  to  continue  this  research. 
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